Applied Machine Learning
GPU Bench & Kernel Hygiene
Profiling CUDA-adjacent workloads without writing custom kernels—learn to read nsys traces and argue for hardware changes.
- Duration
- 3 weeks · intensive weekends
- Format
- Weekend intensive
- Level
- Advanced
- Tuition (informational)
- ₩720,000
Program narrative
We stay above the kernel line while still teaching you to read occupancy and memory bandwidth charts. Each participant brings one slow job; we co-write a remediation brief you can hand to infra.
What is included
- · Trace capture checklists for Windows and Linux runners
- · Roofline sketches for matmul-heavy steps
- · Batch size sweeps with power draw notes
- · Mixed IO patterns and pinned memory experiments
- · Template for requesting larger SM counts vs. faster RAM
- · Peer review of two anonymized traces
- · Cooling and thermal throttling awareness primer
Outcomes you can demo
- · Deliver a two-page perf brief with prioritized fixes
- · Identify one mistaken assumption in your prior benchmarking
- · Propose an A/B hardware plan with measured uncertainty
Mentor of record
Noah Park
Spent a decade in HPC scheduling; focuses on ethical narratives around power use.
Participant questions
Do I need CUDA installed locally?
A remote bench machine is provided. Local installs are optional and unsupported beyond baseline instructions.
Is custom kernel writing included?
No. We reference vendor workshops for Triton/CUDA C when kernel fusion is unavoidable.
What if my workload is CPU-bound?
We will redirect you to the Data Mesh or MLOps tracks; this cohort assumes GPU-bound stages exist.
Recent participant notes
“GPU Bench & Kernel Hygiene forced us to attach power numbers to every “faster” claim. Finance appreciated the appendix.”