CUDA Kernel Development Tutorial

A comprehensive, hands-on tutorial for learning CUDA kernel development—from fundamentals to state-of-the-art LLM kernels including FlashAttention, DeepSeek Sparse Attention, and MoE accelerators.

Who Is This For?

This tutorial is designed for developers aiming to become expert GPU kernel developers. You should have:

Prerequisites

C++ proficiency: Comfortable with templates, modern C++ (C++17)
Python knowledge: For Triton and high-level DSL chapters
Basic GPU understanding: Know what a GPU is and why it's fast for parallel workloads
Linear algebra: Matrix operations, dot products, attention mechanisms

Hardware Requirements

NVIDIA GPU (Volta or newer recommended: V100, A100, RTX 30xx/40xx, H100)
Minimum 8GB VRAM for most examples
16GB+ VRAM recommended for advanced chapters

Software Setup

Note: You need an NVIDIA GPU driver installed at the system level. Verify with nvidia-smi.

Recommended: Micromamba/Conda (One-Command Setup)

The easiest way to set up the environment is using micromamba or conda:

# Install micromamba (if not already installed)
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)

# Create and activate environment
micromamba create -f environment.yml
micromamba activate cuda-tutorial

# Verify installation
nvcc --version
python -c "import torch; print(torch.cuda.is_available())"

This installs CUDA toolkit, CMake, Python dependencies, and PyTorch in one step.

Alternative: Manual Installation

If you prefer manual installation or need system-wide CUDA:

1. CUDA Toolkit

# Ubuntu/Debian
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run

# Verify installation
nvcc --version

2. Build Tools

# CMake 3.18+
sudo apt-get install cmake ninja-build

3. Python Environment

python -m venv cuda-tutorial-env
source cuda-tutorial-env/bin/activate
pip install -r requirements.txt

Optional: CUTLASS

For chapters 4-5 (CUTLASS & CuTe):

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS="80;89;90"
make -j$(nproc)

Profiling Tools

Nsight Compute: Kernel-level profiling (Download)
Nsight Systems: System-wide profiling (Download)

Repository Structure

cuda-kernel-tutorial/
├── chapters/
│   ├── 01_introduction/      # GPU architecture & first kernels
│   ├── 02_cuda_basics/       # Memory hierarchy & indexing
│   ├── 03_profiling/         # Nsight profiling & optimization
│   ├── 04_cutlass_cute/      # CUTLASS & CuTe/CuteDSL
│   ├── 05_deepgemm/          # FP8 GEMM & MoE patterns
│   ├── 06_advanced_cuda/     # Warp primitives & cooperative groups
│   ├── 07_triton/            # Triton kernel development
│   ├── 08_tilelang/          # TileLang & high-level DSLs
│   ├── 09_sparse_attention/  # FlashAttention & sparse patterns
│   ├── 10_moe_accelerators/  # MoE optimization & SonicMoE
│   ├── 11_capstone/          # Full projects
│   └── 12_llm_serving/       # Mini-SGLang & FlashInfer
├── common/                   # Shared utilities
├── benchmarks/               # Performance benchmarks
└── tests/                    # Test harnesses

Curriculum

Part 1: Foundations (Chapters 1-3)

Chapter	Topic	Key Concepts
01	GPU & CUDA Introduction	Threads, blocks, grids, memory model
02	CUDA Basics & Memory	Shared memory, synchronization, indexing
03	Profiling & Optimization	Nsight Compute, memory coalescing, bank conflicts

Part 2: High-Performance Libraries (Chapters 4-5)

Chapter	Topic	Key Concepts
04	CUTLASS & CuTe	Tensor cores, CuTe layouts, CuteDSL
05	DeepGEMM	FP8 GEMM, grouped GEMM, MoE patterns

Part 3: Advanced Techniques (Chapter 6)

Chapter	Topic	Key Concepts
06	Advanced CUDA	Warp primitives, cooperative groups, CUDA graphs

Part 4: High-Level DSLs (Chapters 7-8)

Chapter	Topic	Key Concepts
07	Triton	Block-based kernels, autotuning, fusion
08	TileLang	Tile-centric DSL, pipelining

Part 5: LLM Kernels (Chapters 9-10)

Chapter	Topic	Key Concepts
09	Sparse Attention	FlashAttention, DeepSeek sparse attention
10	MoE Accelerators	Tile-aware optimization, SonicMoE

Part 6: LLM Serving Infrastructure (Chapter 12)

Chapter	Topic	Key Concepts
12	LLM Serving Kernels	Mini-SGLang, FlashInfer, KV cache, NCCL

Part 7: Integration (Chapter 11)

Chapter	Topic	Key Concepts
11	Capstone Projects	End-to-end LLM inference, kernel comparison

Building & Running

Build All Examples

mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;89;90"
make -j$(nproc)

Build Specific Chapter

cd chapters/01_introduction
mkdir build && cd build
cmake ..
make

Run Tests

# C++/CUDA tests
./build/tests/test_all

# Python tests
pytest tests/ -v

Profiling

# Nsight Compute (kernel analysis)
ncu --set full -o profile ./build/chapters/03_profiling/examples/matmul

# Nsight Systems (timeline)
nsys profile -o timeline ./build/chapters/03_profiling/examples/matmul

Learning Path

Suggested Order

Complete beginners: Start with Chapter 01 and proceed sequentially
Some CUDA experience: Skim 01-02, focus on 03-06
Experienced developers: Jump to 04-05 for CUTLASS, then 07-10 for modern techniques

Time Estimates

Chapters 01-03: 2-3 hours each
Chapters 04-06: 4-6 hours each
Chapters 07-10: 6-8 hours each
Chapter 12: 8-12 hours (includes external projects)
Chapter 11: 10+ hours (project-based)

Key Resources

Official Documentation

Open Source References

DeepGEMM - FP8 GEMM from DeepSeek
SonicMoE - Tile-aware MoE optimization
FlashAttention - Memory-efficient attention
TileLang - High-level kernel DSL
LeetCUDA - 200+ CUDA exercises

Research Papers

Tutorials

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Happy kernel writing! 🚀

Start your journey: Chapter 01 - GPU & CUDA Introduction

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
chapters		chapters
common		common
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MULTI_LANG_PLAN.md		MULTI_LANG_PLAN.md
PLAN.md		PLAN.md
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.sh		setup.sh

License

uv-xiao/kernel-tute

Folders and files

Latest commit

History

Repository files navigation

CUDA Kernel Development Tutorial

Who Is This For?

Prerequisites

Hardware Requirements

Software Setup

Recommended: Micromamba/Conda (One-Command Setup)

1. CUDA Toolkit

2. Build Tools

3. Python Environment

Optional: CUTLASS

Profiling Tools

Repository Structure

Curriculum

Part 1: Foundations (Chapters 1-3)

Part 2: High-Performance Libraries (Chapters 4-5)

Part 3: Advanced Techniques (Chapter 6)

Part 4: High-Level DSLs (Chapters 7-8)

Part 5: LLM Kernels (Chapters 9-10)

Part 6: LLM Serving Infrastructure (Chapter 12)

Part 7: Integration (Chapter 11)

Building & Running

Build All Examples

Build Specific Chapter

Run Tests

Profiling

Learning Path

Suggested Order

Time Estimates

Key Resources

Official Documentation

Open Source References

Research Papers

Tutorials

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages