Skip to content

gitctrlx/llama.cu

llama.cu

A pure CUDA implementation of the LLaMA model for high-performance inference and educational purposes. Supports LLaMA 1, 2, and 3 architectures.

This repository demonstrates how to run LLaMA inference using CUDA C++, making it ideal for learning GPU acceleration techniques and understanding transformer internals with minimal dependencies.

Features

  • Pure CUDA Implementation – Direct CUDA kernels for maximum performance without heavy ML frameworks
  • Optimized Matrix Operations – Custom CUDA kernels for matrix multiplication and attention mechanisms
  • Educational – Clean, readable CUDA code with inline documentation for learning GPU programming

Usage

make
./llama stories15M.bin

The examples use small models trained by Andrej Karpathy for demonstration.

Building

Requires NVIDIA CUDA Toolkit (11.0 or later):

make

Or using CMake:

mkdir build && cd build
cmake ..
make

TODO

  • Implement FP16 (float16) version for better memory efficiency and performance
  • Add Flash Attention for faster attention computation

Related Work

If you're interested in LLaMA implementations in other languages:

Acknowledgments

Inspired by llama2.c, llama3.cuda and the broader LLaMA community. This project aims to provide a GPU-accelerated alternative for educational purposes.

License

MIT

About

Llama from scratch in CUDA with Flash Attention.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project