Game Analyzing Model Methods Attentively
An interactive game that teaches you how LLMs work by letting you predict what they'll say next.
Play in your browser - No installation required!
See AGENTS.md for the active code-writing agent profile.
The project has evolved providing tools to experiment with, and benchmark, local models in a variety of ways.
Try to guess which word the AI will choose next. See the probabilities in real-time. Learn how temperature, top-k, and sampling actually work by playing with them.
Watch multiple models collaborate on the same response, swapping control dynamically based on confidence, patterns, or strategy.
Describe what you want to do, and GAMMA generates the command (either with a local model or an agentic CLI, such as Claude Code)
"I want to play with Gemma 2B using temperature 0.9"
python gamma.py game --engine pytorch --model google/gemma-2-2b-it --temperature 0.9"Compare Qwen and DeepSeek on a coding prompt"
python gamma.py game --comparison \
--comparison-models \
ollama:qwen3-coder:30b \
ollama:deepseek-r1:32b \
--prompt "Write a Python function to calculate fibonacci""Meld Gemma models with dynamic blending"
python tools/run_mind_meld_cli.py gemma-1b gemma-2b --blend dynamic"Run the creative preset with a custom prompt"
python tools/run_mind_meld_cli.py --preset creative --prompt "Once upon a time"# Install
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt # Base dependencies
# Install engine of your choice:
pip install -r requirements-pytorch.txt # HuggingFace/Transformers (recommended)
pip install -r requirements-llamacpp.txt # GGUF models
pip install -r requirements-mlx.txt # Apple Silicon (fastest on Mac)
pip install -r requirements-vllm.txt # High-throughput (NVIDIA GPU)
# Play
python gamma.py gameSee Engine Documentation for all engine options including CUDA/ROCm.
GAMMA also auto-detects your Ollama models and HuggingFace cache.
See Game Documentation for more details.
GAMMA supports multiple inference engines for running local LLMs. Engine status:
See also: Model Formats & Engines
| Engine | Backend | Models | Notes |
|---|---|---|---|
| PyTorch | MPS (Mac) / CUDA | HuggingFace models | Full feature support, recommended for HF models |
| MLX | Metal (Apple Silicon) | MLX-optimized models | Fastest on M1/M2/M3 Macs, ~2x faster than PyTorch MPS |
| LlamaCpp | Metal / CUDA / CPU | GGUF quantized models | Great for quantized models, low memory usage |
| Ollama | llama.cpp | Ollama library | Easy setup, auto-detects installed models |
| Engine | Backend | Status |
|---|---|---|
| JAX/Flax | CPU / TPU | JIT tracing issues with some models |
| vLLM | CUDA | Requires NVIDIA GPU with CUDA; not supported on macOS or ROCm in GAMMA |
| ONNX Runtime | CPU / CUDA / CoreML | Requires ONNX-exported models |
| TensorFlow | CPU / GPU | Limited model support |
# Apple Silicon Mac (fastest)
python gamma.py game --engine mlx --model mlx-community/gemma-2-2b-it-4bit
# Any Mac/Linux with PyTorch
python gamma.py game --engine pytorch --model google/gemma-2-2b-it
# Quantized GGUF models (low memory)
python gamma.py game --engine llamacpp --model models/model.gguf
# Ollama models (use GGUF with llama.cpp for logits)
python gamma.py game --engine llamacpp --model /path/to/ollama-model.gguf| Engine | Model | Tokens/sec | Latency p50 |
|---|---|---|---|
| MLX | gemma-2-2b-it-4bit | 10.8 | 92ms |
| PyTorch | phi-2 (2.7B) | 5.8 | 146ms |
| LlamaCpp | qwen2-0.5b-q4 | 4.4 | 174ms |
See Engine Documentation and Core Documentation for details.
The game, comparison, and mind-meld modes require real logits (full token probability distributions). Wrapper engines do not expose logits via HTTP APIs, so the CLI will refuse them.
Engines without logits:
openaihuggingface_inferenceollama
If you are using an OpenAI-compatible vLLM server, you still need the native
vllm engine to access logits.
KV cache sharing (Mind Meld) prefers direct transfer when prompt prefixes
match; otherwise it replays the missing suffix through the target model to
rebuild a correct cache. Replay aligns full-token prefixes to avoid tokenizer
boundary drift. KV cache translation remains experimental and is only attempted
when --allow-kv-cache-translation is set; safety checks will skip translation
unless --force-kv-cache-translation is provided, and it still falls back to
replay if translation is incompatible or fails.
# Interactive menu (recommended)
python gamma.py game
# Quick game with defaults
python gamma.py game --engine llamacpp --model models/model.gguf
# Chat
python gamma.py game --chat --model qwen3-coder:30b
# Compare models
python gamma.py game --comparison \
--comparison-models model1 model2
# Mind meld (new CLI with presets and aliases)
python tools/run_mind_meld_cli.py --preset creative
python tools/run_mind_meld_cli.py gemma-1b gemma-2b --blend dynamic --steps 50
python tools/run_mind_meld_cli.py gemma-1b@Optimist gemma-2b@Skeptic --preset debate
# Other common options
--help # Detailed explanation of commands
--temperature 0.7 # Sampling randomness (0.1-2.0)
--top-k 40 # Top-K filtering
--top-p 0.95 # Nucleus sampling
--sampling-strategy sample # sample or argmax/greedy
--steps 50 # Max generation steps
--show-attention # Show attention heatmaps
--verbose # Detailed explanations
--prompt-chat-template # Use chat template for --prompt/--initial-prompt (auto for instruct models)
--no-prompt-chat-template # Force raw --prompt (skip chat template)
--prompt-system "TEXT" # System prompt for chat templates
--no-default-system # Disable the default system prompt
--no-step-delay # Mind Meld: disable per-step delay
--summary-only # Mind Meld: show only final output and brief stats (no live per-round stats)
--max-sentences N # Mind Meld: stop after N sentences in the generated output
--shared-chat-template # Mind Meld: reuse one chat template across models (auto-enabled when templates differ; disable with --no-shared-chat-template)
--stop-text "TEXT" # Mind Meld: stop when generated output contains TEXT (repeatable; common chat end markers are used automatically when templates are applied)
--translate-logits # Mind Meld: translate logits into the next model's vocab during swaps (experimental)
--order-neutral # Mind Meld: alias for --use-weighted-average to reduce swap-order sensitivity
--soft-swap # Mind Meld: blend all models each step but keep swap cadence by boosting the active model
--soft-swap-weight W # Mind Meld: weight multiplier for the active model in --soft-swap (default 1.5)
--force-kv-cache-translation # Mind Meld: force KV cache translation even when safety checks fail (unsafe)
--repetition-penalty 1.1 # Reduce repeated tokens during sampling (>1.0)KV cache sharing prefers direct transfer when prompt prefixes match. When they
differ, Mind Meld replays the missing suffix through the target model to rebuild
its cache (lossless, but more compute) instead of copying KV entries across
incompatible tokenizations. KV cache translation is only attempted when
--allow-kv-cache-translation is set; safety checks will skip translation
unless --force-kv-cache-translation is provided, and it still falls back to
replay if it fails.
- Mind Meld: Multi-model collaboration system
- Benchmarks: Performance testing and DREAM suite
- Comparison: Model comparison tools
- Utilities: Profiling, caching, optimization
- Integrations: OpenAI API, LangChain compatibility
MIT - See LICENSE