📄 ASPLOS'26 Paper | Paper (To appear - contact authors: ruibo.fan@connect.hkust-gz.edu.cn) | Code
Authors: Ruibo Fan¹, Xiangrui Yu¹, Xinglin Pan¹, Zeyu Li¹, Weile Luo¹, Qiang Wang², Wei Wang³, Xiaowen Chu¹,³
¹ HKUST(GZ), ² HIT(SZ), ³ HKUST
ZipServ is the first lossless compression framework co-designed for high-performance LLM inference on GPUs. Through hardware-aware compression algorithms and fused kernel design, ZipServ achieves significant inference acceleration while maintaining bit-exact numerical precision, solving the performance bottleneck of traditional lossless compression methods on GPUs.
Note: ZipServ is optimized for GDDR memory-based GPUs (e.g., RTX 4090, L40S, RTX 5090, RTX 6000 Pro) where memory bandwidth is the primary bottleneck. Performance gains may vary on HBM-based datacenter GPUs.
- Lossless Compression: Up to 30% model size reduction with bit-exact numerical precision
- Significant Speedup: Up to 2.21× kernel-level speedup over cuBLAS, 1.22× average end-to-end speedup over vLLM
- Hardware Co-design: Optimized for NVIDIA Tensor Core architectures
- First Practical System: First demonstration that lossless compression can simultaneously provide storage savings and inference acceleration
Addressing the fundamental mismatch between traditional entropy coding (e.g., Huffman, ANS) and GPU SIMT execution model, ZipServ proposes TCA-TBE encoding:
- Fixed-Length Format: Exploits highly skewed distribution of BFloat16 weight exponents using triple bitmap encoding
- Constant-Time Parallel Decoding: Achieves branch-free parallel decoding through lightweight bitwise operations, eliminating control-flow divergence
- Tensor Core Alignment: Hierarchical tiling organization (8×8, 16×64, 64×64) seamlessly aligns with Tensor Core computation patterns
Addressing the redundant memory access caused by traditional decoupled pipelines (decompress to global memory, then compute), ZipServ designs a fused decompression-GEMM kernel:
- "Load-Compressed, Compute-Decompressed": Compressed weights are loaded directly into registers and decoded on-the-fly
- Eliminate Intermediate Buffers: Decompressed results feed directly into Tensor Cores without going through global memory
- Maximize Compute Intensity: Overlaps data movement with computation to fully utilize memory bandwidth
- CUDA Toolkit: 11.8 or later
- GPU: NVIDIA Ampere architecture or newer (Compute Capability ≥ 8.0)
- Compiler: g++ with C++14 support
# Initialize environment
source Init.sh
# Build core library
cd build && make
# Build benchmarks
cd ../kernel_benchmark && source test_env && make# Matrix multiplication benchmark
cd kernel_benchmark
./test_mm M K N SplitK
# Example: ./test_mm 4096 4096 128 1If you use ZipServ in your research, please cite our ASPLOS'26 paper:
@inproceedings{zipserv2026,
title={ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression},
author={Fan, Ruibo and Yu, Xiangrui and Pan, Xinglin and Li, Zeyu and Luo, Weile and Wang, Qiang and Wang, Wei and Chu, Xiaowen},
booktitle={Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'26)},
year={2026}
}This project is licensed under the Apache License 2.0. See LICENSE for details.
Note: This is research prototype code. Please conduct thorough testing and validation before production use.