forked from harubaru/convogpt
-
Notifications
You must be signed in to change notification settings - Fork 19
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
For our use-case of fine-tuning LMs on up to 2048 tokens, flash attention might get us a ~2-4x speedup and a VRAM usage reduction of up to 10x. Sounds pretty amazing, so I'd like to give it a shot. Some code inspiration:
- diffusers PR 532: shows how to use flash attention through xformers, probably the least painful way to go about it
- GPT-NeoX PR 725: flash attention implementation in Eleuther's fork of Megatron
- Official repo
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
✅ Done