Applied Machine Learning

GPU Bench & Kernel Hygiene

Profiling CUDA-adjacent workloads without writing custom kernels—learn to read nsys traces and argue for hardware changes.

Duration
3 weeks · intensive weekends
Format
Weekend intensive
Level
Advanced
Tuition (informational)
₩720,000
GPU Bench & Kernel Hygiene

Program narrative

We stay above the kernel line while still teaching you to read occupancy and memory bandwidth charts. Each participant brings one slow job; we co-write a remediation brief you can hand to infra.

What is included

  • · Trace capture checklists for Windows and Linux runners
  • · Roofline sketches for matmul-heavy steps
  • · Batch size sweeps with power draw notes
  • · Mixed IO patterns and pinned memory experiments
  • · Template for requesting larger SM counts vs. faster RAM
  • · Peer review of two anonymized traces
  • · Cooling and thermal throttling awareness primer

Outcomes you can demo

  • · Deliver a two-page perf brief with prioritized fixes
  • · Identify one mistaken assumption in your prior benchmarking
  • · Propose an A/B hardware plan with measured uncertainty

Mentor of record

Noah Park

Noah Park

Spent a decade in HPC scheduling; focuses on ethical narratives around power use.

Participant questions

Do I need CUDA installed locally?

A remote bench machine is provided. Local installs are optional and unsupported beyond baseline instructions.

Is custom kernel writing included?

No. We reference vendor workshops for Triton/CUDA C when kernel fusion is unavoidable.

What if my workload is CPU-bound?

We will redirect you to the Data Mesh or MLOps tracks; this cohort assumes GPU-bound stages exist.

Recent participant notes

“GPU Bench & Kernel Hygiene forced us to attach power numbers to every “faster” claim. Finance appreciated the appendix.”
— Rina · ML scientist · 5/5