AttnBench

Benchmarking dense and sparse attention mechanisms using Swift and MLX on Apple Silicon.

Overview

AttnBench provides comprehensive benchmarks comparing different attention variants used in transformer models, from standard dense attention to novel sparse/efficient mechanisms. This is valuable for understanding the performance characteristics of Apple Silicon (M1/M2/M3/M4) when running different attention patterns.

Dense Attention (O(N²) complexity)

MHA (Multi-Head Attention): Standard attention where each head has its own K and V projections
GQA (Grouped Query Attention): K and V are shared across groups of query heads
MQA (Multi-Query Attention): All query heads share a single K and V head

Sparse/Efficient Attention (Sub-quadratic complexity)

Sliding Window Attention (SWA): O(N·W) complexity - only attends to a local window around each position
Block-Sparse Attention: Divides the attention matrix into blocks, computing only local + global patterns
Linear Attention: O(N) complexity via the kernel trick: Q(K^TV) instead of (QK^T)V
Causal Linear Attention: O(N) causal attention using cumulative sums

Why This Matters

Research Value

Most "sliding window" implementations in high-level frameworks are fake - they just mask the full N×N matrix (still O(N²) compute). This benchmark implements true efficient attention using tensor slicing and gather operations.

Key questions this benchmark helps answer:

At what sequence length does SWA become faster than MHA on M-series chips?
Do M4 AMX units prefer dense matmuls or batched small matmuls?
What's the overhead of linear attention vs the quadratic savings crossover?
How well does MLX handle gather/slice operations vs dense compute?

Novel Implementations

Mechanism	Complexity	MLX Challenge
Sliding Window	O(N·W)	Tests `gather`/`slice` performance vs dense on AMX
Block-Sparse	O(N·B + N·G)	Tests batched small matmuls vs one giant matmul
Linear Attention	O(N·D²)	Tests numerical stability with feature maps
Causal Linear	O(N·D²)	Tests `cumsum` scan operation efficiency

Requirements

macOS 13.3 or later
Apple Silicon (M1/M2/M3/M4)
Xcode (required for Metal shader compilation)
CMake and Ninja (for building)

Installation

Install Build Tools

brew install cmake ninja

Install Xcode

Xcode is required to compile Metal shaders. Install it from the Mac App Store, then:

sudo xcode-select -s /Applications/Xcode.app/Contents/Developer
xcodebuild -downloadComponent MetalToolchain

Building

mkdir build && cd build
cmake .. -G Ninja
ninja

Running

./build/AttnBench

Sample Output

name,b,n,dModel,heads,kvHeads,iters,msPerIter
MHA,1,128,1024,16,16,50,1.4981
GQA(kv=4),1,128,1024,16,4,50,1.1278
MQA(kv=1),1,128,1024,16,1,50,1.3023
SWA(w=32),1,128,1024,16,16,50,2.1456
SWA(w=64),1,128,1024,16,16,50,2.8934
BlockSparse(bs=32),1,128,1024,16,16,50,1.7823
LinearAttn,1,128,1024,16,16,50,0.9234
CausalLinearAttn,1,128,1024,16,16,50,1.4521
...

Output Columns

Column	Description
`name`	Attention mechanism type
`b`	Batch size
`n`	Sequence length
`dModel`	Model dimension
`heads`	Number of query heads
`kvHeads`	Number of key/value heads
`iters`	Number of iterations
`msPerIter`	Milliseconds per iteration

Attention Mechanisms Explained

Sliding Window Attention (SWA)

Instead of computing the full N×N attention matrix:

Standard: scores[i, j] for all i, j ∈ [0, N)  → O(N²)

SWA only computes scores within a window:

SWA: scores[i, j] for j ∈ [i - W/2, i + W/2]  → O(N·W)

Implementation: Uses padding + slicing to construct overlapping windows, avoiding the full matrix computation. This tests whether MLX's gather operations are efficient enough on Apple Silicon.

Block-Sparse Attention

Divides the sequence into blocks and computes:

Local attention: Each block attends only to itself
Global attention: First G tokens attend to/from all positions (sink tokens)

┌───┬───┬───┬───┐
│ ■ │   │   │ ■ │  ← Global tokens attend everywhere
├───┼───┼───┼───┤
│ ■ │ ■ │   │   │  ← Local block attention
├───┼───┼───┼───┤
│ ■ │   │ ■ │   │
├───┼───┼───┼───┤
│ ■ │   │   │ ■ │
└───┴───┴───┴───┘

Implementation: Reshapes into blocks [B, H, N/Block, Block, D] and performs block-wise matmuls. Tests whether M-series chips prefer fewer large kernels or many small ones.

Linear Attention

Rewrites softmax attention using kernel feature maps:

Standard: Attn = softmax(QK^T / √d) @ V          → O(N²)
Linear:   Attn ≈ φ(Q) @ (φ(K)^T @ V) / norm     → O(N·D²)

The key insight: computing φ(K)^T @ V first gives [D, D] instead of [N, N].

Feature map: Uses ELU(x) + 1 to ensure non-negative values (required for valid attention weights).

Causal Linear Attention

For autoregressive (causal) attention, each position only attends to past positions:

output[i] = Σⱼ≤ᵢ (φ(k[j]) ⊗ v[j]) @ φ(q[i])

Implementation: Uses cumsum to accumulate KV outer products efficiently. Tests MLX's scan operation performance.

Running Tests

Tests are run via xcodebuild (not CMake):

xcodebuild test -scheme AttnBench-Package -destination 'platform=OS X'

Project Structure

attn-bench/
├── CMakeLists.txt              # CMake build configuration
├── Package.swift               # Swift Package Manager manifest
├── LICENSE                     # MIT License
├── Sources/
│   ├── AttnBench/
│   │   └── AttnBench.swift     # Main executable entry point
│   └── AttnBenchLib/
│       └── Attention.swift     # Core attention implementations
│           ├── Dense: MHA, GQA, MQA
│           ├── SlidingWindowAttention
│           ├── BlockSparseAttention
│           ├── LinearAttention
│           └── CausalLinearAttention
└── Tests/
    └── AttnBenchTests/
        └── AttnBenchTests.swift # Unit tests (30+ tests)

Configuration

Modify benchmark parameters in Sources/AttnBench/AttnBench.swift:

let b = 1                              // Batch size
let dModel = 1024                      // Model dimension
let heads = 16                         // Number of attention heads
let seqs = [128, 256, 512, 1024]       // Sequence lengths to test
let iters = 50                         // Iterations per benchmark
let warmup = 10                        // Warmup iterations

// Sparse attention parameters
let windowSizes = [32, 64, 128]        // Sliding window sizes
let blockSizes = [32, 64]              // Block sizes for block-sparse

Interpreting Results

Expected Patterns

Short sequences (N < 256): Dense attention (MHA) is usually fastest due to M-series AMX efficiency with large matrices
Long sequences (N > 512): Sparse methods should start winning
Linear attention: Constant-ish time regardless of N, but with D² overhead
Sliding window: Should show clear O(N·W) scaling

Key Metrics to Watch

Crossover point: Where SWA becomes faster than MHA
Block size sensitivity: Optimal block size for block-sparse
Linear attention overhead: Fixed cost that determines when it's worthwhile

References

License

MIT

Acknowledgments

MLX Swift - Apple's machine learning framework for Apple Silicon
Inspired by research on efficient attention mechanisms for long-context LLMs

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Sources		Sources
Tests/AttnBenchTests		Tests/AttnBenchTests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AttnBench

Overview

Dense Attention (O(N²) complexity)

Sparse/Efficient Attention (Sub-quadratic complexity)

Why This Matters

Research Value

Novel Implementations

Requirements

Installation

Install Build Tools

Install Xcode

Building

Running

Sample Output

Output Columns

Attention Mechanisms Explained

Sliding Window Attention (SWA)

Block-Sparse Attention

Linear Attention

Causal Linear Attention

Running Tests

Project Structure

Configuration

Interpreting Results

Expected Patterns

Key Metrics to Watch

References

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

dirmacs/attn-bench

Folders and files

Latest commit

History

Repository files navigation

AttnBench

Overview

Dense Attention (O(N²) complexity)

Sparse/Efficient Attention (Sub-quadratic complexity)

Why This Matters

Research Value

Novel Implementations

Requirements

Installation

Install Build Tools

Install Xcode

Building

Running

Sample Output

Output Columns

Attention Mechanisms Explained

Sliding Window Attention (SWA)

Block-Sparse Attention

Linear Attention

Causal Linear Attention

Running Tests

Project Structure

Configuration

Interpreting Results

Expected Patterns

Key Metrics to Watch

References

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages