Stars
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
VPTQ, A Flexible and Extreme low-bit quantization algorithm
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient Multi-head Latent Attention Kernels
Train transformer language models with reinforcement learning.
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
Fast Hadamard transform in CUDA, with a PyTorch interface
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Building a quick conversation-based search demo with Lepton AI.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Official implementation of Half-Quadratic Quantization (HQQ)
Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency.
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
The Triton TensorRT-LLM Backend
Awesome LLM compression research papers and tools.
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
A parser, editor and profiler tool for ONNX models.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration


