Official PyTorch implementation of EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
- [2024/08] Added support for Mistral-Large-Instruct quantization. W2g64 Mistral-Large-Instruct compressed to 35 GB with only 4% accuracy loss.
- [2024/07] New feature: Transfer EfficientQAT quantized models to
GPTQ v2andBitBLASformats, loadable through GPTQModel. - [2024/07] Initial release of EfficientQAT, pushing the limits of uniform (INT) quantization efficiently.
- Clone the repository:
git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT
- Set up the environment:
conda create -n efficientqat python=3.11
conda activate efficientqat
pip install -r requirements.txt
We provide pre-quantized EfficientQAT models. For details, see the full model table in the original README.
EfficientQAT involves two training phases: Block-wise training (Block-AP) and end-to-end quantization parameter training (E2E-QP).
Modify the --model path in the script, then run:
examples/block_ap/LlamaForCasualLM/w2g64.batModify the --quant_model_path in the script, then run:
For RedPajama dataset:
examples/e2e_qp/Llama-2-7b/w2g64-redpajama.batFor Alpaca dataset:
examples/e2e_qp/Llama-2-7b/w2g64-alpaca.bat- Download pre-quantized models:
pip install huggingface_hub
huggingface-cli download ChenMnZ/Llama-2-7b-EfficientQAT-w2g64 --local-dir ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64
- Evaluate:
@echo off
set CUDA_VISIBLE_DEVICES=0
python main_block_ap.py ^
--resume_quant ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64 ^
--net Llama-2 ^
--wbits 2 ^
--group_size 64 ^
--output_dir ./output/inference_results/ ^
--eval_ppl ^
--eval_tasks piqa,arc_easy,arc_challenge,hellaswag,winograndeInstall gptqmodel:
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
bash install.sh
Transfer options:
- To GPTQ format:
examples/model_transfer/efficientqat_to_gptq/LlamaForCasualLM.bat- To BitBLAS format:
examples/model_transfer/efficientqat_to_bitblas/LlamaForCasualLM.bat- Convert fp32 to half-precision:
examples/model_transfer/fp32_to_16/LlamaForCasualLM.batExample for GPTQ or BitBLAS formats:
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ"
# or "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"
tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)
model = GPTQModel.from_quantized(quant_dir)
print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))If you find this work useful, please cite:
@article{efficientqat,
title={EfficientQAT: Efficient Quantization-Aware Training for Large Language Models},
author={Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Qiao, Yu and Luo, Ping},
journal={arXiv preprint arXiv:2407.11062},
year={2024}
}