Investigate and implement Flash Attention

For our use-case of fine-tuning LMs on up to 2048 tokens, [flash attention](https://github.com/HazyResearch/flash-attention) might get us a ~2-4x speedup and a VRAM usage reduction of up to 10x. Sounds pretty amazing, so I'd like to give it a shot. Some code inspiration:

- [diffusers PR 532](https://github.com/huggingface/diffusers/pull/532): shows how to use flash attention through xformers, probably the least painful way to go about it
- [GPT-NeoX PR 725](https://github.com/EleutherAI/gpt-neox/pull/725): flash attention implementation in Eleuther's fork of Megatron
- [Official repo](https://github.com/HazyResearch/flash-attention)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and implement Flash Attention #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate and implement Flash Attention #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions