This repository provides wheels for the pre-built flash-attention.
Since building flash-attention takes a very long time and is resource-intensive, I also build and provide combinations of CUDA and PyTorch that are not officially distributed.
The building Github Actions Workflow can be found here.
The built packages are available on the release page.
This repository uses a self-hosted runner and AWS CodeBuild for building the wheels. If you find this project helpful, please consider sponsoring to help maintain the infrastructure!
Special thanks to @KiralyCraft for providing the computing resources used to build wheels. Thank you!!
- Select the versions for Python, CUDA, PyTorch, and flash_attn.
flash_attn-[flash_attn Version]+cu[CUDA Version]torch[PyTorch Version]-cp[Python Version]-cp[Python Version]-linux_x86_64.whl
# Example: Python 3.11, CUDA 12.4, PyTorch 2.5, and flash_attn 2.6.3
flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl-
Find the corresponding version of a wheel from the Packages page and releases page.
-
Direct Install or Download and Local Install
# Direct Install
pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.0/flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl
# Download and Local Install
wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.0/flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl
pip install ./flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whlNote
Since v0.7.0, wheels are built with manylinux2_28 platform. These wheels compatible with old glibc versions (<=2.17).
Note
Since v0.5.0, wheels are built with a local version label indicating the CUDA and PyTorch versions.
Example: pip list -> flash_attn==2.8.3 -> flash_attn==2.8.3+cu130torch2.9
See ./docs/packages.md for the full list of available packages.
History of this repository is available here.
If you use this repository in your research and find it helpful, please cite this repository!
@misc{flash-attention-prebuild-wheels,
author = {Morioka, Junya},
year = {2025},
title = {mjun0812/flash-attention-prebuild-wheels},
url = {https://github.com/mjun0812/flash-attention-prebuild-wheels},
howpublished = {https://github.com/mjun0812/flash-attention-prebuild-wheels},
}- @okaris : Sponsored me!
- @xhiroga : Sponsored me!
- cjustus613 : Buy me a coffee!
- @KiralyCraft : Provided with computing resource!
- @kun432 : Buy me a coffee!
- @wodeyuzhou : Sponsored me!
- Gabr1e1 : Buy me a coffee!
@inproceedings{dao2022flashattention,
title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2022}
}
@inproceedings{dao2023flashattention2,
title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
author={Dao, Tri},
booktitle={International Conference on Learning Representations (ICLR)},
year={2024}
}If you cannot find the version you are looking for, you can fork this repository and create a wheel on GitHub Actions.
- Fork this repository
- Edit Python script
create_matrix.pyto set the version you want to build. You can use GitHub hosted runners or self-hosted runners with below settings. - Add tag
v*.*.*to trigger the build workflow.git tag v*.*.* && git push --tags
Please note that depending on the combination of versions, it may not be possible to build.
In some version combinations, you cannot build wheels on GitHub-hosted runners due to job time limitations. To build the wheels for these versions, you can use self-hosted runners.
gh api \
-X POST \
/repos/[OWNER]/[REPOSITORY]/actions/runners/registration-tokenClone the repository and navigate to the self-hosted-runner directory.
git clone https://github.com/mjun0812/flash-attention-prebuild-wheels.git
cd flash-attention-prebuild-wheels/self-hosted-runnerCreate environment files from the template. Create one file per architecture you want to build.
# For x86_64
cp env.template env
# For ARM64
cp env.template env.armEdit the environment file(s) to set the required variables.
# Registry Token for GitHub Personal Access Token
PERSONAL_ACCESS_TOKEN=[Github Personal Access Token]
# or Registry Token for GitHub Actions Runner
REGISTRY_TOKEN=[Runner Registry Token]
# Optional
RUNNER_LABELS=Linux,self-hostedEdit the compose.yml file if you use a repository forked from this repository.
runner:
platform: linux/amd64
privileged: true
restart: always
env_file:
- .env
environment:
REPOSITORY_URL: https://github.com/[YOUR_USERNAME]/flash-attention-prebuild-wheels
RUNNER_NAME: self-hosted-runner
RUNNER_GROUP: default
TARGET_ARCH: x64
build:
context: .
dockerfile: Dockerfile
args:
GH_RUNNER_VERSION: 2.329.0
TARGET_ARCH: x64
PLATFORM: linux/amd64
volumes:
- fa-self:/var/lib/docker
runner-arm:
platform: linux/arm64
privileged: true
restart: always
env_file:
- .env.arm
environment:
REPOSITORY_URL: https://github.com/[YOUR_USERNAME]/flash-attention-prebuild-wheels
RUNNER_NAME: self-hosted-runner-arm
RUNNER_GROUP: default
TARGET_ARCH: arm64
build:
context: .
dockerfile: Dockerfile
args:
GH_RUNNER_VERSION: 2.329.0
TARGET_ARCH: arm64
PLATFORM: linux/arm64
volumes:
- fa-self-arm:/var/lib/dockerBuild and run the docker container(s).
# x86_64 runner
docker compose build runner
docker compose up -d runner
# ARM64 runner (optional)
docker compose build runner-arm
docker compose up -d runner-armThis repository builds wheels across multiple platforms and environments:
| Platform | Runner Type | Container Image |
|---|---|---|
| Linux x86_64 | GitHub-hosted (ubuntu-22.04) |
- |
| Linux x86_64 | Self-hosted | ubuntu:22.04 or manylinux_2_28_x86_64 |
| Linux ARM64 | GitHub-hosted (ubuntu-22.04-arm) |
- |
| Windows x86_64 | GitHub-hosted (windows-2022) |
- |
| Windows x86_64 | AWS CodeBuild | - |
