llama.cu

A pure CUDA implementation of the LLaMA model for high-performance inference and educational purposes. Supports LLaMA 1, 2, and 3 architectures.

This repository demonstrates how to run LLaMA inference using CUDA C++, making it ideal for learning GPU acceleration techniques and understanding transformer internals with minimal dependencies.

Features

Pure CUDA Implementation – Direct CUDA kernels for maximum performance without heavy ML frameworks
Optimized Matrix Operations – Custom CUDA kernels for matrix multiplication and attention mechanisms
Educational – Clean, readable CUDA code with inline documentation for learning GPU programming

Usage

make
./llama stories15M.bin

The examples use small models trained by Andrej Karpathy for demonstration.

Building

Requires NVIDIA CUDA Toolkit (11.0 or later):

make

Or using CMake:

mkdir build && cd build
cmake ..
make

TODO

Implement FP16 (float16) version for better memory efficiency and performance
Add Flash Attention for faster attention computation

Related Work

If you're interested in LLaMA implementations in other languages:

llama.go – Pure Go implementation
llama.np – NumPy-based implementation

Acknowledgments

Inspired by llama2.c, llama3.cuda and the broader LLaMA community. This project aims to provide a GPU-accelerated alternative for educational purposes.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
llama.cu		llama.cu
stories15M.bin		stories15M.bin
tokenizer.bin		tokenizer.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

llama.cu

Features

Usage

Building

TODO

Related Work

Acknowledgments

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Uh oh!

Languages

Uh oh!

License

gitctrlx/llama.cu

Folders and files

Latest commit

History

Repository files navigation

llama.cu

Features

Usage

Building

TODO

Related Work

Acknowledgments

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Uh oh!

Languages