Tracin

Follow

Qi Zhang (qizh) Tracin

Follow

13 followers · 10 following

Achievements

Achievements

Stars

deepseek-ai / EPLB

Expert Parallelism Load Balancer

Python 1,343 200 Updated Mar 24, 2025

OpenHands / OpenHands

🙌 OpenHands: AI-Driven Development

Python 67,433 8,394 Updated Feb 3, 2026

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,559 940 Updated Jan 18, 2026

xlite-dev / HGEMM

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 148 7 Updated May 10, 2025

dottxt-ai / outlines

Structured Outputs

Python 13,365 662 Updated Feb 2, 2026

microsoft / VPTQ

VPTQ, A Flexible and Extreme low-bit quantization algorithm

Python 674 51 Updated Apr 25, 2025

HanGuo97 / flute

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

C++ 387 18 Updated Apr 13, 2025

fla-org / native-sparse-attention

🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"

Python 964 48 Updated Mar 19, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,149 809 Updated Feb 3, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,954 1,083 Updated Feb 3, 2026

deepseek-ai / FlashMLA

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,443 977 Updated Jan 20, 2026

huggingface / trl

Train transformer language models with reinforcement learning.

Python 17,256 2,464 Updated Feb 3, 2026

salykova / sgemm.cu

High-Performance FP32 GEMM on CUDA devices

Cuda 117 8 Updated Jan 21, 2025

spcl / QuaRot

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 480 57 Updated Nov 26, 2024

Dao-AILab / fast-hadamard-transform

Fast Hadamard transform in CUDA, with a PyTorch interface

C 280 50 Updated Oct 19, 2025

ModelTC / LightCompress

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

Python 675 68 Updated Nov 19, 2025

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,956 337 Updated Jan 18, 2026

leptonai / search_with_lepton

Building a quick conversation-based search demo with Lepton AI.

TypeScript 8,121 1,024 Updated Dec 2, 2025

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,005 86 Updated Sep 4, 2024

dropbox / hqq

Official implementation of Half-Quadratic Quantization (HQQ)

Python 912 88 Updated Dec 18, 2025

ModelTC / EasyLLM

Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency.

Python 49 9 Updated Sep 18, 2024

Cornell-RelaxML / QuIP

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"

Python 396 35 Updated Feb 24, 2024

JIA-Lab-research / LongLoRA

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

Python 2,699 294 Updated Aug 14, 2024

triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend

919 135 Updated Feb 3, 2026

HuangOwen / Awesome-LLM-Compression

Awesome LLM compression research papers and tools.

1,771 117 Updated Nov 10, 2025

NVIDIA / TensorRT-LLM

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,794 2,067 Updated Feb 3, 2026

turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Python 2,907 221 Updated Sep 30, 2023

ModelTC / Dipoorlet

Offline Quantization Tools for Deploy.

Python 142 19 Updated Dec 28, 2023

ThanatosShinji / onnx-tool

A parser, editor and profiler tool for ONNX models.

Python 478 67 Updated Nov 3, 2025

mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,431 292 Updated Jul 17, 2025