Skip to content

uv-xiao/kernel-tute

Repository files navigation

CUDA Kernel Development Tutorial

A comprehensive, hands-on tutorial for learning CUDA kernel development—from fundamentals to state-of-the-art LLM kernels including FlashAttention, DeepSeek Sparse Attention, and MoE accelerators.

Who Is This For?

This tutorial is designed for developers aiming to become expert GPU kernel developers. You should have:

Prerequisites

  • C++ proficiency: Comfortable with templates, modern C++ (C++17)
  • Python knowledge: For Triton and high-level DSL chapters
  • Basic GPU understanding: Know what a GPU is and why it's fast for parallel workloads
  • Linear algebra: Matrix operations, dot products, attention mechanisms

Hardware Requirements

  • NVIDIA GPU (Volta or newer recommended: V100, A100, RTX 30xx/40xx, H100)
  • Minimum 8GB VRAM for most examples
  • 16GB+ VRAM recommended for advanced chapters

Software Setup

Note: You need an NVIDIA GPU driver installed at the system level. Verify with nvidia-smi.

Recommended: Micromamba/Conda (One-Command Setup)

The easiest way to set up the environment is using micromamba or conda:

# Install micromamba (if not already installed)
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)

# Create and activate environment
micromamba create -f environment.yml
micromamba activate cuda-tutorial

# Verify installation
nvcc --version
python -c "import torch; print(torch.cuda.is_available())"

This installs CUDA toolkit, CMake, Python dependencies, and PyTorch in one step.

Alternative: Manual Installation

If you prefer manual installation or need system-wide CUDA:

1. CUDA Toolkit

# Ubuntu/Debian
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run

# Verify installation
nvcc --version

2. Build Tools

# CMake 3.18+
sudo apt-get install cmake ninja-build

3. Python Environment

python -m venv cuda-tutorial-env
source cuda-tutorial-env/bin/activate
pip install -r requirements.txt

Optional: CUTLASS

For chapters 4-5 (CUTLASS & CuTe):

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS="80;89;90"
make -j$(nproc)

Profiling Tools

  • Nsight Compute: Kernel-level profiling (Download)
  • Nsight Systems: System-wide profiling (Download)

Repository Structure

cuda-kernel-tutorial/
├── chapters/
│   ├── 01_introduction/      # GPU architecture & first kernels
│   ├── 02_cuda_basics/       # Memory hierarchy & indexing
│   ├── 03_profiling/         # Nsight profiling & optimization
│   ├── 04_cutlass_cute/      # CUTLASS & CuTe/CuteDSL
│   ├── 05_deepgemm/          # FP8 GEMM & MoE patterns
│   ├── 06_advanced_cuda/     # Warp primitives & cooperative groups
│   ├── 07_triton/            # Triton kernel development
│   ├── 08_tilelang/          # TileLang & high-level DSLs
│   ├── 09_sparse_attention/  # FlashAttention & sparse patterns
│   ├── 10_moe_accelerators/  # MoE optimization & SonicMoE
│   ├── 11_capstone/          # Full projects
│   └── 12_llm_serving/       # Mini-SGLang & FlashInfer
├── common/                   # Shared utilities
├── benchmarks/               # Performance benchmarks
└── tests/                    # Test harnesses

Curriculum

Part 1: Foundations (Chapters 1-3)

Chapter Topic Key Concepts
01 GPU & CUDA Introduction Threads, blocks, grids, memory model
02 CUDA Basics & Memory Shared memory, synchronization, indexing
03 Profiling & Optimization Nsight Compute, memory coalescing, bank conflicts

Part 2: High-Performance Libraries (Chapters 4-5)

Chapter Topic Key Concepts
04 CUTLASS & CuTe Tensor cores, CuTe layouts, CuteDSL
05 DeepGEMM FP8 GEMM, grouped GEMM, MoE patterns

Part 3: Advanced Techniques (Chapter 6)

Chapter Topic Key Concepts
06 Advanced CUDA Warp primitives, cooperative groups, CUDA graphs

Part 4: High-Level DSLs (Chapters 7-8)

Chapter Topic Key Concepts
07 Triton Block-based kernels, autotuning, fusion
08 TileLang Tile-centric DSL, pipelining

Part 5: LLM Kernels (Chapters 9-10)

Chapter Topic Key Concepts
09 Sparse Attention FlashAttention, DeepSeek sparse attention
10 MoE Accelerators Tile-aware optimization, SonicMoE

Part 6: LLM Serving Infrastructure (Chapter 12)

Chapter Topic Key Concepts
12 LLM Serving Kernels Mini-SGLang, FlashInfer, KV cache, NCCL

Part 7: Integration (Chapter 11)

Chapter Topic Key Concepts
11 Capstone Projects End-to-end LLM inference, kernel comparison

Building & Running

Build All Examples

mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;89;90"
make -j$(nproc)

Build Specific Chapter

cd chapters/01_introduction
mkdir build && cd build
cmake ..
make

Run Tests

# C++/CUDA tests
./build/tests/test_all

# Python tests
pytest tests/ -v

Profiling

# Nsight Compute (kernel analysis)
ncu --set full -o profile ./build/chapters/03_profiling/examples/matmul

# Nsight Systems (timeline)
nsys profile -o timeline ./build/chapters/03_profiling/examples/matmul

Learning Path

Suggested Order

  1. Complete beginners: Start with Chapter 01 and proceed sequentially
  2. Some CUDA experience: Skim 01-02, focus on 03-06
  3. Experienced developers: Jump to 04-05 for CUTLASS, then 07-10 for modern techniques

Time Estimates

  • Chapters 01-03: 2-3 hours each
  • Chapters 04-06: 4-6 hours each
  • Chapters 07-10: 6-8 hours each
  • Chapter 12: 8-12 hours (includes external projects)
  • Chapter 11: 10+ hours (project-based)

Key Resources

Official Documentation

Open Source References

Research Papers

Tutorials

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.


Happy kernel writing! 🚀

Start your journey: Chapter 01 - GPU & CUDA Introduction

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •