Skip to content

XRerate/TEAL

 
 

Repository files navigation

Training-Free Acivation Sparsity in Large Language Models

[Paper][Blog]

TEAL induces up to 40-50% model-wide activation sparsity in modern LLMs with minimal degradation, resulting in an up to 1.53-1.8x speedup in single-batch decoding.

The current release supports:

  • FP16 inference for Llama-2/3 models using uniform sparsities
  • Accuracy evaluation for Llama-2/3 and Mistral models using uniform and block-wise greedy sparsities

Prerequisites

  1. Open repository in devcontainer.

  2. Download model weights and convert to gpt-fast format (scripts/prepare.sh):

repo_id=meta-llama/Llama-2-7b-hf \
python scripts/download.py \
--repo_id $repo_id \
--path $SAVE_PATH && \
python scripts/convert_hf_checkpoint.py \
--checkpoint_dir $SAVE_PATH/$repo_id
  1. Repeat 2 with different repo_id if you want to use other models
  • Supported models:

    • meta-llama/Llama-2-7b-hf

    (below models are claimed to be supported, but exact repository id may not be correct. Just wrote with guess)

    • meta-llama/Llama-2-13b-hf
    • meta-llama/Llama-2-70b-hf
    • meta-llama/Meta-Llama-3-8B
    • meta-llama/Meta-Llama-3-70B
    • mistralai/Mistral-7B-v0.1

Inference Usage

For easy usage, we provide calibrated thresholds for Llama-2/3 and Mistral models in models/ folder.

  1. Navigate to gpt-fast:
cd gpt-fast
  1. Run dense inference (scripts/base_run.sh):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python generate.py \
--compile \
--checkpoint_path $SAVE_PATH/$repo_id/model.pth \
--interactive
  1. Run sparse inference! (scripts/run.sh):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python generate.py \
--compile \
--checkpoint_path $SAVE_PATH/$repo_id/model.pth \
--hist_path ../models/$repo_id/histograms \
--sparsity 0.5 \
--interactive

To benchmark inference speed, remove --interactive.

Please treat the current inference implementation as just a proof of concept! There are a few limitations:

  • Only FP16 is supported, as Triton does not currently support BF16 atomic_add.
  • Block-wise greedy sparsities are not currently supported (expect to have this very soon!).
  • Quantized sparse kernels are not currently supported (though, would love a PR!).
  • Speculative decoding is untested

Accuracy Usage

  1. Navigate to TEAL:
cd TEAL
  1. Construct histograms for threshold calibration (scripts/grab_acts.bash):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/grab_acts.py \
--model_name $SAVE_PATH/$repo_id \
--output_path $OUTPUT_PATH/$repo_id
  1. Run perplexity test (scripts/ppl_test.bash):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/ppl_test.py \
--model_name $SAVE_PATH/$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--sparsity 0.5
  1. (Optional) Run block-wise greedy optimization (scripts/greedyopt.bash):
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/greedyopt.py \
--model_name $SAVE_PATH$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--target_sparsity 0.9 \
--base_step_size 0.05 \
--last_fraction 0.25
CUDA_VISIBLE_DEVICES=0 repo_id=meta-llama/Llama-2-7b-hf \
python teal/ppl_test.py \
--model_name $SAVE_PATH/$repo_id \
--teal_path $OUTPUT_PATH/$repo_id \
--sparsity 0.5 \
--greedy_flag

Citation

If you find TEAL useful, please consider citing:

@misc{liu2024trainingfreeactivationsparsitylarge,
      title={Training-Free Activation Sparsity in Large Language Models}, 
      author={James Liu and Pragaash Ponnusamy and Tianle Cai and Han Guo and Yoon Kim and Ben Athiwaratkun},
      year={2024},
      eprint={2408.14690},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.14690}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 55.8%
  • Jupyter Notebook 43.5%
  • Other 0.7%