TEAL induces up to 40-50% model-wide activation sparsity in modern LLMs with minimal degradation, resulting in an up to 1.53-1.8x speedup in single-batch decoding.
The current release supports:
- FP16 inference for Llama-2/3 models using uniform sparsities
- Accuracy evaluation for Llama-2/3 and Mistral models using uniform and block-wise greedy sparsities
-
Open repository in devcontainer.
-
Download model weights and convert to gpt-fast format (
scripts/prepare.sh):
repo_id=meta-llama/Llama-2-7b-hf \
python scripts/download.py \
--repo_id $repo_id \
--path $SAVE_PATH && \
python scripts/convert_hf_checkpoint.py \
--checkpoint_dir $SAVE_PATH/$repo_id- Repeat 2 with different
repo_idif you want to use other models
-
Supported models:
- meta-llama/Llama-2-7b-hf
(below models are claimed to be supported, but exact repository id may not be correct. Just wrote with guess)
- meta-llama/Llama-2-13b-hf
- meta-llama/Llama-2-70b-hf
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-70B
- mistralai/Mistral-7B-v0.1
For easy usage, we provide calibrated thresholds for Llama-2/3 and Mistral models in models/ folder.
- Navigate to gpt-fast:
cd gpt-fast- Run dense inference (
scripts/base_run.sh):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python generate.py \
--compile \
--checkpoint_path $SAVE_PATH/$repo_id/model.pth \
--interactive- Run sparse inference! (
scripts/run.sh):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python generate.py \
--compile \
--checkpoint_path $SAVE_PATH/$repo_id/model.pth \
--hist_path ../models/$repo_id/histograms \
--sparsity 0.5 \
--interactiveTo benchmark inference speed, remove --interactive.
Please treat the current inference implementation as just a proof of concept! There are a few limitations:
- Only FP16 is supported, as Triton does not currently support BF16
atomic_add. - Block-wise greedy sparsities are not currently supported (expect to have this very soon!).
- Quantized sparse kernels are not currently supported (though, would love a PR!).
- Speculative decoding is untested
- Navigate to TEAL:
cd TEAL- Construct histograms for threshold calibration (
scripts/grab_acts.bash):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/grab_acts.py \
--model_name $SAVE_PATH/$repo_id \
--output_path $OUTPUT_PATH/$repo_id- Run perplexity test (
scripts/ppl_test.bash):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/ppl_test.py \
--model_name $SAVE_PATH/$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--sparsity 0.5- (Optional) Run block-wise greedy optimization (
scripts/greedyopt.bash):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/greedyopt.py \
--model_name $SAVE_PATH$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--target_sparsity 0.9 \
--base_step_size 0.05 \
--last_fraction 0.25CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/ppl_test.py \
--model_name $SAVE_PATH/$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--sparsity 0.5 \
--greedy_flagIf you find TEAL useful, please consider citing:
@misc{liu2024trainingfreeactivationsparsitylarge,
title={Training-Free Activation Sparsity in Large Language Models},
author={James Liu and Pragaash Ponnusamy and Tianle Cai and Han Guo and Yoon Kim and Ben Athiwaratkun},
year={2024},
eprint={2408.14690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.14690},
}
