Optimized Kernels
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek, Qwen, Llama, Gemma, TTS 2x faster with 70% less VRAM.
A high-throughput and memory-efficient inference and serving engine for LLMs
Efficient Triton Kernels for LLM Training
Dynamic Memory Management for Serving LLMs without PagedAttention
FlashMLA: Efficient Multi-head Latent Attention Kernels
OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels




