A comprehensive, hands-on tutorial for learning CUDA kernel development—from fundamentals to state-of-the-art LLM kernels including FlashAttention, DeepSeek Sparse Attention, and MoE accelerators.
This tutorial is designed for developers aiming to become expert GPU kernel developers. You should have:
- C++ proficiency: Comfortable with templates, modern C++ (C++17)
- Python knowledge: For Triton and high-level DSL chapters
- Basic GPU understanding: Know what a GPU is and why it's fast for parallel workloads
- Linear algebra: Matrix operations, dot products, attention mechanisms
- NVIDIA GPU (Volta or newer recommended: V100, A100, RTX 30xx/40xx, H100)
- Minimum 8GB VRAM for most examples
- 16GB+ VRAM recommended for advanced chapters
Note: You need an NVIDIA GPU driver installed at the system level. Verify with
nvidia-smi.
The easiest way to set up the environment is using micromamba or conda:
# Install micromamba (if not already installed)
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
# Create and activate environment
micromamba create -f environment.yml
micromamba activate cuda-tutorial
# Verify installation
nvcc --version
python -c "import torch; print(torch.cuda.is_available())"This installs CUDA toolkit, CMake, Python dependencies, and PyTorch in one step.
Alternative: Manual Installation
If you prefer manual installation or need system-wide CUDA:
# Ubuntu/Debian
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run
# Verify installation
nvcc --version# CMake 3.18+
sudo apt-get install cmake ninja-buildpython -m venv cuda-tutorial-env
source cuda-tutorial-env/bin/activate
pip install -r requirements.txtFor chapters 4-5 (CUTLASS & CuTe):
git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS="80;89;90"
make -j$(nproc)cuda-kernel-tutorial/
├── chapters/
│ ├── 01_introduction/ # GPU architecture & first kernels
│ ├── 02_cuda_basics/ # Memory hierarchy & indexing
│ ├── 03_profiling/ # Nsight profiling & optimization
│ ├── 04_cutlass_cute/ # CUTLASS & CuTe/CuteDSL
│ ├── 05_deepgemm/ # FP8 GEMM & MoE patterns
│ ├── 06_advanced_cuda/ # Warp primitives & cooperative groups
│ ├── 07_triton/ # Triton kernel development
│ ├── 08_tilelang/ # TileLang & high-level DSLs
│ ├── 09_sparse_attention/ # FlashAttention & sparse patterns
│ ├── 10_moe_accelerators/ # MoE optimization & SonicMoE
│ ├── 11_capstone/ # Full projects
│ └── 12_llm_serving/ # Mini-SGLang & FlashInfer
├── common/ # Shared utilities
├── benchmarks/ # Performance benchmarks
└── tests/ # Test harnesses
| Chapter | Topic | Key Concepts |
|---|---|---|
| 01 | GPU & CUDA Introduction | Threads, blocks, grids, memory model |
| 02 | CUDA Basics & Memory | Shared memory, synchronization, indexing |
| 03 | Profiling & Optimization | Nsight Compute, memory coalescing, bank conflicts |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 04 | CUTLASS & CuTe | Tensor cores, CuTe layouts, CuteDSL |
| 05 | DeepGEMM | FP8 GEMM, grouped GEMM, MoE patterns |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 06 | Advanced CUDA | Warp primitives, cooperative groups, CUDA graphs |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 07 | Triton | Block-based kernels, autotuning, fusion |
| 08 | TileLang | Tile-centric DSL, pipelining |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 09 | Sparse Attention | FlashAttention, DeepSeek sparse attention |
| 10 | MoE Accelerators | Tile-aware optimization, SonicMoE |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 12 | LLM Serving Kernels | Mini-SGLang, FlashInfer, KV cache, NCCL |
| Chapter | Topic | Key Concepts |
|---|---|---|
| 11 | Capstone Projects | End-to-end LLM inference, kernel comparison |
mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;89;90"
make -j$(nproc)cd chapters/01_introduction
mkdir build && cd build
cmake ..
make# C++/CUDA tests
./build/tests/test_all
# Python tests
pytest tests/ -v# Nsight Compute (kernel analysis)
ncu --set full -o profile ./build/chapters/03_profiling/examples/matmul
# Nsight Systems (timeline)
nsys profile -o timeline ./build/chapters/03_profiling/examples/matmul- Complete beginners: Start with Chapter 01 and proceed sequentially
- Some CUDA experience: Skim 01-02, focus on 03-06
- Experienced developers: Jump to 04-05 for CUTLASS, then 07-10 for modern techniques
- Chapters 01-03: 2-3 hours each
- Chapters 04-06: 4-6 hours each
- Chapters 07-10: 6-8 hours each
- Chapter 12: 8-12 hours (includes external projects)
- Chapter 11: 10+ hours (project-based)
- DeepGEMM - FP8 GEMM from DeepSeek
- SonicMoE - Tile-aware MoE optimization
- FlashAttention - Memory-efficient attention
- TileLang - High-level kernel DSL
- LeetCUDA - 200+ CUDA exercises
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
Happy kernel writing! 🚀
Start your journey: Chapter 01 - GPU & CUDA Introduction