Skip to main content

Performance Benchmarks

Comprehensive performance analysis of Osaurus compared to other local LLM solutions on Apple Silicon.

Methodology

All benchmarks were conducted under controlled conditions to ensure fair comparison:

  • Hardware: Apple M2 Pro, 32GB RAM
  • macOS Version: 15.5
  • Model: Llama 3.2 3B Instruct (4-bit quantization)
  • Prompt: "Explain quantum computing in simple terms"
  • Settings: Default configuration for each server
  • Measurements: 20-run average, excluding first run

Key Metrics

Time to First Token (TTFT)

The time from request to first token generation—critical for perceived responsiveness.

ServerTTFT (ms)Relative
Ollama33ms1.0×
Osaurus87ms2.6×
LM Studio113ms3.4×

Throughput

Characters generated per second during streaming.

ServerThroughputRelative
LM Studio588 chars/s1.06×
Osaurus554 chars/s1.0×
Ollama430 chars/s0.78×

End-to-End Latency

Total time from request to completion.

ServerTotal TimeRelative
LM Studio1.22s1.0×
Osaurus1.24s1.02×
Ollama1.62s1.33×

Optimization Tips

Based on benchmark results:

  1. For Speed

    • Use 4-bit quantization
    • Choose smaller models (2-3B)
    • Limit context to 2048 tokens
    • Disable logging in production
  2. For Quality

    • Use 8-bit quantization when possible
    • Select 7B+ models
    • Allow full context windows
    • Enable temperature sampling
  3. For Efficiency

    • Keep 1-2 models loaded
    • Use model aliases for quick switching
    • Monitor memory pressure
    • Restart periodically for long-running servers

Conclusions

Osaurus demonstrates:

  • Competitive Performance — Matches or exceeds alternatives in key metrics
  • Efficient Memory Usage — Lower RAM footprint than competitors
  • Consistent Latency — Predictable performance under load
  • Native Optimization — Leverages Apple Silicon effectively

The benchmarks show Osaurus is particularly well-suited for:

  • Production deployments requiring consistent performance
  • Memory-constrained environments
  • High-throughput applications
  • Native macOS integrations

Memory Quality — LoCoMo Benchmark

We evaluate memory quality using the LoCoMo benchmark (ACL 2024) via EasyLocomo. LoCoMo tests how well systems recall facts, events, and relationships from multi-session conversations spanning weeks to months.

Osaurus uses a no-context evaluation mode where the LLM receives no conversation transcript — only the memory context assembled by the retrieval system. The X-Osaurus-Agent-Id header routes each question to the correct agent's memory store. This tests pure memory retrieval quality rather than full-context recall.

LoCoMo Leaderboard

SystemF1 Score
MemU92.09%
CORE88.24%
Human baseline~88%
Memobase85% (temporal)
Mem066.9%
Osaurus (Gemini 2.5 Flash)57.08%
OpenAI Memory52.9%
GPT-3.5-turbo-16K (no memory)37.8%
GPT-4-turbo (no memory)~32%

Osaurus Breakdown by Category

CategoryCountF1 Score
Open-domain84161.44%
Adversarial44690.36%
Multi-hop28241.94%
Temporal32123.16%
Single-hop9622.10%
Overall1,98657.08%

Running the Benchmark

# 1. Set up EasyLocomo (clones repo, applies patch, creates venv)
make bench-setup

# 2. Configure .env in benchmarks/EasyLocomo/
echo 'OPENAI_API_KEY=osaurus' > benchmarks/EasyLocomo/.env
echo 'OPENAI_API_BASE=http://localhost:1337/v1' >> benchmarks/EasyLocomo/.env

# 3. Ingest LoCoMo data (full extraction — takes several hours, only needed once)
make bench-ingest

# 4. Fast chunk re-ingestion (no LLM calls — use after code changes)
make bench-ingest-chunks

# 5. Run evaluation
make bench-run
tip

You may want to temporarily increase token budgets in the memory configuration file (~/.osaurus/config/memory.json) before running benchmarks. The default production budgets are tuned for everyday use, not maximal recall.


To contribute benchmarks, join the Discord community.