Benchmarking dense and sparse attention mechanisms using Swift and MLX on Apple Silicon.
AttnBench provides comprehensive benchmarks comparing different attention variants used in transformer models, from standard dense attention to novel sparse/efficient mechanisms. This is valuable for understanding the performance characteristics of Apple Silicon (M1/M2/M3/M4) when running different attention patterns.
- MHA (Multi-Head Attention): Standard attention where each head has its own K and V projections
- GQA (Grouped Query Attention): K and V are shared across groups of query heads
- MQA (Multi-Query Attention): All query heads share a single K and V head
- Sliding Window Attention (SWA): O(N·W) complexity - only attends to a local window around each position
- Block-Sparse Attention: Divides the attention matrix into blocks, computing only local + global patterns
- Linear Attention: O(N) complexity via the kernel trick: Q(K^TV) instead of (QK^T)V
- Causal Linear Attention: O(N) causal attention using cumulative sums
Most "sliding window" implementations in high-level frameworks are fake - they just mask the full N×N matrix (still O(N²) compute). This benchmark implements true efficient attention using tensor slicing and gather operations.
Key questions this benchmark helps answer:
- At what sequence length does SWA become faster than MHA on M-series chips?
- Do M4 AMX units prefer dense matmuls or batched small matmuls?
- What's the overhead of linear attention vs the quadratic savings crossover?
- How well does MLX handle gather/slice operations vs dense compute?
| Mechanism | Complexity | MLX Challenge |
|---|---|---|
| Sliding Window | O(N·W) | Tests gather/slice performance vs dense on AMX |
| Block-Sparse | O(N·B + N·G) | Tests batched small matmuls vs one giant matmul |
| Linear Attention | O(N·D²) | Tests numerical stability with feature maps |
| Causal Linear | O(N·D²) | Tests cumsum scan operation efficiency |
- macOS 13.3 or later
- Apple Silicon (M1/M2/M3/M4)
- Xcode (required for Metal shader compilation)
- CMake and Ninja (for building)
brew install cmake ninjaXcode is required to compile Metal shaders. Install it from the Mac App Store, then:
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer
xcodebuild -downloadComponent MetalToolchainmkdir build && cd build
cmake .. -G Ninja
ninja./build/AttnBenchname,b,n,dModel,heads,kvHeads,iters,msPerIter
MHA,1,128,1024,16,16,50,1.4981
GQA(kv=4),1,128,1024,16,4,50,1.1278
MQA(kv=1),1,128,1024,16,1,50,1.3023
SWA(w=32),1,128,1024,16,16,50,2.1456
SWA(w=64),1,128,1024,16,16,50,2.8934
BlockSparse(bs=32),1,128,1024,16,16,50,1.7823
LinearAttn,1,128,1024,16,16,50,0.9234
CausalLinearAttn,1,128,1024,16,16,50,1.4521
...
| Column | Description |
|---|---|
name |
Attention mechanism type |
b |
Batch size |
n |
Sequence length |
dModel |
Model dimension |
heads |
Number of query heads |
kvHeads |
Number of key/value heads |
iters |
Number of iterations |
msPerIter |
Milliseconds per iteration |
Instead of computing the full N×N attention matrix:
Standard: scores[i, j] for all i, j ∈ [0, N) → O(N²)
SWA only computes scores within a window:
SWA: scores[i, j] for j ∈ [i - W/2, i + W/2] → O(N·W)
Implementation: Uses padding + slicing to construct overlapping windows, avoiding the full matrix computation. This tests whether MLX's gather operations are efficient enough on Apple Silicon.
Divides the sequence into blocks and computes:
- Local attention: Each block attends only to itself
- Global attention: First G tokens attend to/from all positions (sink tokens)
┌───┬───┬───┬───┐
│ ■ │ │ │ ■ │ ← Global tokens attend everywhere
├───┼───┼───┼───┤
│ ■ │ ■ │ │ │ ← Local block attention
├───┼───┼───┼───┤
│ ■ │ │ ■ │ │
├───┼───┼───┼───┤
│ ■ │ │ │ ■ │
└───┴───┴───┴───┘
Implementation: Reshapes into blocks [B, H, N/Block, Block, D] and performs block-wise matmuls. Tests whether M-series chips prefer fewer large kernels or many small ones.
Rewrites softmax attention using kernel feature maps:
Standard: Attn = softmax(QK^T / √d) @ V → O(N²)
Linear: Attn ≈ φ(Q) @ (φ(K)^T @ V) / norm → O(N·D²)
The key insight: computing φ(K)^T @ V first gives [D, D] instead of [N, N].
Feature map: Uses ELU(x) + 1 to ensure non-negative values (required for valid attention weights).
For autoregressive (causal) attention, each position only attends to past positions:
output[i] = Σⱼ≤ᵢ (φ(k[j]) ⊗ v[j]) @ φ(q[i])
Implementation: Uses cumsum to accumulate KV outer products efficiently. Tests MLX's scan operation performance.
Tests are run via xcodebuild (not CMake):
xcodebuild test -scheme AttnBench-Package -destination 'platform=OS X'attn-bench/
├── CMakeLists.txt # CMake build configuration
├── Package.swift # Swift Package Manager manifest
├── LICENSE # MIT License
├── Sources/
│ ├── AttnBench/
│ │ └── AttnBench.swift # Main executable entry point
│ └── AttnBenchLib/
│ └── Attention.swift # Core attention implementations
│ ├── Dense: MHA, GQA, MQA
│ ├── SlidingWindowAttention
│ ├── BlockSparseAttention
│ ├── LinearAttention
│ └── CausalLinearAttention
└── Tests/
└── AttnBenchTests/
└── AttnBenchTests.swift # Unit tests (30+ tests)
Modify benchmark parameters in Sources/AttnBench/AttnBench.swift:
let b = 1 // Batch size
let dModel = 1024 // Model dimension
let heads = 16 // Number of attention heads
let seqs = [128, 256, 512, 1024] // Sequence lengths to test
let iters = 50 // Iterations per benchmark
let warmup = 10 // Warmup iterations
// Sparse attention parameters
let windowSizes = [32, 64, 128] // Sliding window sizes
let blockSizes = [32, 64] // Block sizes for block-sparse- Short sequences (N < 256): Dense attention (MHA) is usually fastest due to M-series AMX efficiency with large matrices
- Long sequences (N > 512): Sparse methods should start winning
- Linear attention: Constant-ish time regardless of N, but with D² overhead
- Sliding window: Should show clear O(N·W) scaling
- Crossover point: Where SWA becomes faster than MHA
- Block size sensitivity: Optimal block size for block-sparse
- Linear attention overhead: Fixed cost that determines when it's worthwhile
- Sliding Window Attention
- Block-Sparse Attention (MIT-HAN Lab)
- Flash Linear Attention
- Linear Transformers (Katharopoulos et al.)
MIT
- MLX Swift - Apple's machine learning framework for Apple Silicon
- Inspired by research on efficient attention mechanisms for long-context LLMs