Applied Machine Learning

GPU Bench & Kernel Hygiene

Profiling CUDA-adjacent workloads without writing custom kernels—learn to read nsys traces and argue for hardware changes.

Duration: 3 weeks · intensive weekends
Format: Weekend intensive
Level: Advanced
Tuition (informational): ₩720,000

Request information Read Money-Back Policy

Program narrative

We stay above the kernel line while still teaching you to read occupancy and memory bandwidth charts. Each participant brings one slow job; we co-write a remediation brief you can hand to infra.

What is included

· Trace capture checklists for Windows and Linux runners
· Roofline sketches for matmul-heavy steps
· Batch size sweeps with power draw notes
· Mixed IO patterns and pinned memory experiments
· Template for requesting larger SM counts vs. faster RAM
· Peer review of two anonymized traces
· Cooling and thermal throttling awareness primer

Outcomes you can demo

· Deliver a two-page perf brief with prioritized fixes
· Identify one mistaken assumption in your prior benchmarking
· Propose an A/B hardware plan with measured uncertainty

Mentor of record

Noah Park

Spent a decade in HPC scheduling; focuses on ethical narratives around power use.

Participant questions

Do I need CUDA installed locally?

A remote bench machine is provided. Local installs are optional and unsupported beyond baseline instructions.

Is custom kernel writing included?

No. We reference vendor workshops for Triton/CUDA C when kernel fusion is unavoidable.

What if my workload is CPU-bound?

We will redirect you to the Data Mesh or MLOps tracks; this cohort assumes GPU-bound stages exist.

Recent participant notes

“GPU Bench & Kernel Hygiene forced us to attach power numbers to every “faster” claim. Finance appreciated the appendix.”

— Rina · ML scientist · 5/5