Training-Free Acivation Sparsity in Large Language Models

TEAL induces up to 40-50% model-wide activation sparsity in modern LLMs with minimal degradation, resulting in an up to 1.53-1.8x speedup in single-batch decoding.

The current release supports:

FP16 inference for Llama-2/3 models using uniform sparsities
Accuracy evaluation for Llama-2/3 and Mistral models using uniform and block-wise greedy sparsities

Prerequisites

Open repository in devcontainer.
Download model weights and convert to gpt-fast format (scripts/prepare.sh):

repo_id=meta-llama/Llama-2-7b-hf \
python scripts/download.py \
--repo_id $repo_id \
--path $SAVE_PATH && \
python scripts/convert_hf_checkpoint.py \
--checkpoint_dir $SAVE_PATH/$repo_id

Repeat 2 with different repo_id if you want to use other models

Supported models:
- meta-llama/Llama-2-7b-hf
(below models are claimed to be supported, but exact repository id may not be correct. Just wrote with guess)
- meta-llama/Llama-2-13b-hf
- meta-llama/Llama-2-70b-hf
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-70B
- mistralai/Mistral-7B-v0.1

Inference Usage

For easy usage, we provide calibrated thresholds for Llama-2/3 and Mistral models in models/ folder.

Navigate to gpt-fast:

cd gpt-fast

Run dense inference (scripts/base_run.sh):

CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python generate.py \
--compile \
--checkpoint_path $SAVE_PATH/$repo_id/model.pth \
--interactive

Run sparse inference! (scripts/run.sh):

CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python generate.py \
--compile \
--checkpoint_path $SAVE_PATH/$repo_id/model.pth \
--hist_path ../models/$repo_id/histograms \
--sparsity 0.5 \
--interactive

To benchmark inference speed, remove --interactive.

Please treat the current inference implementation as just a proof of concept! There are a few limitations:

Only FP16 is supported, as Triton does not currently support BF16 atomic_add.
Block-wise greedy sparsities are not currently supported (expect to have this very soon!).
Quantized sparse kernels are not currently supported (though, would love a PR!).
Speculative decoding is untested

Accuracy Usage

Navigate to TEAL:

cd TEAL

Construct histograms for threshold calibration (scripts/grab_acts.bash):

CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/grab_acts.py \
--model_name $SAVE_PATH/$repo_id \
--output_path $OUTPUT_PATH/$repo_id

Run perplexity test (scripts/ppl_test.bash):

CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/ppl_test.py \
--model_name $SAVE_PATH/$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--sparsity 0.5

(Optional) Run block-wise greedy optimization (scripts/greedyopt.bash):

CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/greedyopt.py \
--model_name $SAVE_PATH$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--target_sparsity 0.9 \
--base_step_size 0.05 \
--last_fraction 0.25

CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/ppl_test.py \
--model_name $SAVE_PATH/$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--sparsity 0.5 \
--greedy_flag

Citation

If you find TEAL useful, please consider citing:

@misc{liu2024trainingfreeactivationsparsitylarge,
      title={Training-Free Activation Sparsity in Large Language Models}, 
      author={James Liu and Pragaash Ponnusamy and Tianle Cai and Han Guo and Yoon Kim and Ben Athiwaratkun},
      year={2024},
      eprint={2408.14690},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.14690}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.devcontainer		.devcontainer
figures		figures
gpt-fast		gpt-fast
kernels		kernels
models		models
notebooks		notebooks
scripts		scripts
teal		teal
utils		utils
.env.credentials		.env.credentials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_original.md		README_original.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Training-Free Acivation Sparsity in Large Language Models

Prerequisites

Inference Usage

Accuracy Usage

Citation

About

Uh oh!

Releases

Packages

Languages

License

XRerate/TEAL

Folders and files

Latest commit

History

Repository files navigation

Training-Free Acivation Sparsity in Large Language Models

Prerequisites

Inference Usage

Accuracy Usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages