LLM inference simulator for analyzing serving systems. Simulates GPU clusters serving LLM inference workloads with realistic performance modeling.
- Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
- Multiple Scheduling Policies: FCFS, Priority, SJF, and more
- Chunked Prefill: Simulates realistic request interleaving
- KV Cache Management: Models GPU memory and KV cache utilization
- Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
- WebAssembly Support: Run simulations in the browser via WASM
- CLI Tool: Standalone binary for command-line usage
inference-lab uses discrete-event simulation to model the behavior of a
multi-GPU node serving LLM inference requests with the vLLM library. It
contains a facsimile of the vLLM queueing, scheduling, and execution logic,
with only the actual model inference replaced by a performance model based on
the supplied GPU specs and model architecture.
Within each simulation step, the simulator:
- Processes any newly arrived requests, adding them to the scheduling queue.
- Schedules requests to serve based on the selected scheduling policy.
- Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
- Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.
Caveats:
- This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
- We simulate tensor parallel execution, but don't model multi-GPU communication overheads.
cargo add inference-labnpm install @doublewordai/inference-labcargo install inference-labNote: The CLI tool is only available if you install it using cargo install inference-lab (see above).
# Run with default configuration
inference-lab --config configs/config.toml
# Example output shows TTFT, E2E latency, throughput, and utilization metricsuse inference_lab::simulation::Simulator;
use inference_lab::config::SimulationConfig;
let config = SimulationConfig::from_file("config.toml")?;
let mut simulator = Simulator::new(config);
let results = simulator.run();
println!("Mean TTFT: {:.2}ms", results.ttft_mean * 1000.0);
println!("P99 E2E: {:.2}ms", results.e2e_p99 * 1000.0);
println!("Throughput: {:.1} tok/s", results.throughput);import init, { run_simulation } from '@doubleword/inference-lab';
await init();
const config = {
hardware: {
name: "H100",
compute_flops: 1.513e15,
memory_bandwidth: 3.35e12,
memory_capacity: 85899345920,
bytes_per_param: 2
},
model: {
name: "Llama-3-70B",
num_parameters: 70000000000,
num_layers: 80,
hidden_dim: 8192,
num_heads: 64,
num_kv_heads: 8,
max_seq_len: 8192
},
scheduler: {
max_num_batched_tokens: 8192,
max_num_seqs: 256,
policy: "fcfs",
enable_chunked_prefill: true,
block_size: 16
},
workload: {
arrival_pattern: "poisson",
arrival_rate: 5.0,
num_requests: 400,
seed: 42,
input_len_dist: {
type: "lognormal",
mean: 6.9,
std_dev: 0.7
},
output_len_dist: {
type: "lognormal",
mean: 5.3,
std_dev: 0.8
}
}
};
const results = run_simulation(JSON.stringify(config));
console.log('TTFT P50:', results.metrics.ttft_p50);
console.log('Throughput:', results.metrics.output_tokens_per_sec);Configuration files use TOML format and specify:
- Hardware: GPU specs (FLOPS, bandwidth, VRAM)
- Model: LLM architecture (parameters, layers, heads)
- Scheduler: Policies, max tokens, chunked prefill settings
- Workload: Request arrival patterns and distributions
Example configurations are in the configs/ directory:
config.toml- Default H100 + Llama-3-70B setuptest_blog.toml- Closed-loop benchmark (64 users)qwen3_30b_a3b.toml- Qwen model configuration
cargo build --release
./target/release/inference-lab --config configs/config.tomlnpm run build
# Outputs to pkg/ directory# Publish to npm (requires authentication)
npm run build
npm publish --access public
# Publish Rust crate
cargo publishinference-lab/
├── src/
│ ├── simulation/ # Core simulator logic
│ ├── scheduler/ # Scheduling policies (FCFS, Priority, SJF)
│ ├── compute/ # Performance calculations
│ ├── kv_cache/ # KV cache management
│ ├── request/ # Request generation and tracking
│ ├── metrics/ # Performance metrics collection
│ ├── config/ # Configuration structures
│ ├── lib.rs # Library root
│ ├── main.rs # CLI entry point
│ └── wasm.rs # WebAssembly bindings
├── configs/ # Example configurations
├── Cargo.toml # Rust package manifest
└── package.json # npm package manifest
The simulator tracks:
- TTFT (Time to First Token): Prefill latency
- E2E (End-to-End): Total request latency
- TPOT (Time Per Output Token): Decode latency per token
- Throughput: Tokens generated per second
- Utilization: Compute and memory bandwidth usage
- KV Cache: Memory utilization over time
Results include percentiles (p50, p90, p95, p99) and means.
MIT