Skip to content
View Tracin's full-sized avatar

Block or report Tracin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Expert Parallelism Load Balancer

Python 1,343 200 Updated Mar 24, 2025

🙌 OpenHands: AI-Driven Development

Python 67,433 8,394 Updated Feb 3, 2026

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,559 940 Updated Jan 18, 2026

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 148 7 Updated May 10, 2025

Structured Outputs

Python 13,365 662 Updated Feb 2, 2026

VPTQ, A Flexible and Extreme low-bit quantization algorithm

Python 674 51 Updated Apr 25, 2025

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

C++ 387 18 Updated Apr 13, 2025

🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"

Python 964 48 Updated Mar 19, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,149 809 Updated Feb 3, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 8,954 1,083 Updated Feb 3, 2026

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,443 977 Updated Jan 20, 2026

Train transformer language models with reinforcement learning.

Python 17,256 2,464 Updated Feb 3, 2026

High-Performance FP32 GEMM on CUDA devices

Cuda 117 8 Updated Jan 21, 2025

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 480 57 Updated Nov 26, 2024

Fast Hadamard transform in CUDA, with a PyTorch interface

C 280 50 Updated Oct 19, 2025

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

Python 675 68 Updated Nov 19, 2025

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,956 337 Updated Jan 18, 2026

Building a quick conversation-based search demo with Lepton AI.

TypeScript 8,121 1,024 Updated Dec 2, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,005 86 Updated Sep 4, 2024

Official implementation of Half-Quadratic Quantization (HQQ)

Python 912 88 Updated Dec 18, 2025

Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency.

Python 49 9 Updated Sep 18, 2024

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"

Python 396 35 Updated Feb 24, 2024

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

Python 2,699 294 Updated Aug 14, 2024

The Triton TensorRT-LLM Backend

919 135 Updated Feb 3, 2026

Awesome LLM compression research papers and tools.

1,771 117 Updated Nov 10, 2025

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,794 2,067 Updated Feb 3, 2026

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Python 2,907 221 Updated Sep 30, 2023

Offline Quantization Tools for Deploy.

Python 142 19 Updated Dec 28, 2023

A parser, editor and profiler tool for ONNX models.

Python 478 67 Updated Nov 3, 2025

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,431 292 Updated Jul 17, 2025
Next