Performance Benchmarks

Name: Osaurus
Author: Osaurus

Comprehensive performance analysis of Osaurus compared to other local LLM solutions on Apple Silicon.

Methodology

All benchmarks were conducted under controlled conditions to ensure fair comparison:

Hardware: Apple M2 Pro, 32GB RAM
macOS Version: 15.5
Model: Llama 3.2 3B Instruct (4-bit quantization)
Prompt: "Explain quantum computing in simple terms"
Settings: Default configuration for each server
Measurements: 20-run average, excluding first run

Key Metrics

Time to First Token (TTFT)

The time from request to first token generation—critical for perceived responsiveness.

Server	TTFT (ms)	Relative
Ollama	33ms	1.0×
Osaurus	87ms	2.6×
LM Studio	113ms	3.4×

Throughput

Characters generated per second during streaming.

Server	Throughput	Relative
LM Studio	588 chars/s	1.06×
Osaurus	554 chars/s	1.0×
Ollama	430 chars/s	0.78×

End-to-End Latency

Total time from request to completion.

Server	Total Time	Relative
LM Studio	1.22s	1.0×
Osaurus	1.24s	1.02×
Ollama	1.62s	1.33×

Optimization Tips

Based on benchmark results:

For Speed
- Use 4-bit quantization
- Choose smaller models (2-3B)
- Limit context to 2048 tokens
- Disable logging in production
For Quality
- Use 8-bit quantization when possible
- Select 7B+ models
- Allow full context windows
- Enable temperature sampling
For Efficiency
- Keep 1-2 models loaded
- Use model aliases for quick switching
- Monitor memory pressure
- Restart periodically for long-running servers

Conclusions

Osaurus demonstrates:

Competitive Performance — Matches or exceeds alternatives in key metrics
Efficient Memory Usage — Lower RAM footprint than competitors
Consistent Latency — Predictable performance under load
Native Optimization — Leverages Apple Silicon effectively

The benchmarks show Osaurus is particularly well-suited for:

Production deployments requiring consistent performance
Memory-constrained environments
High-throughput applications
Native macOS integrations

Memory Quality — LoCoMo Benchmark

We evaluate memory quality using the LoCoMo benchmark (ACL 2024) via EasyLocomo. LoCoMo tests how well systems recall facts, events, and relationships from multi-session conversations spanning weeks to months.

Osaurus uses a no-context evaluation mode where the LLM receives no conversation transcript — only the memory context assembled by the retrieval system. The X-Osaurus-Agent-Id header routes each question to the correct agent's memory store. This tests pure memory retrieval quality rather than full-context recall.

LoCoMo Leaderboard

System	F1 Score
MemU	92.09%
CORE	88.24%
Human baseline	~88%
Memobase	85% (temporal)
Mem0	66.9%
Osaurus (Gemini 2.5 Flash)	57.08%
OpenAI Memory	52.9%
GPT-3.5-turbo-16K (no memory)	37.8%
GPT-4-turbo (no memory)	~32%

Osaurus Breakdown by Category

Category	Count	F1 Score
Open-domain	841	61.44%
Adversarial	446	90.36%
Multi-hop	282	41.94%
Temporal	321	23.16%
Single-hop	96	22.10%
Overall	1,986	57.08%

Running the Benchmark

# 1. Set up EasyLocomo (clones repo, applies patch, creates venv)
make bench-setup

# 2. Configure .env in benchmarks/EasyLocomo/
echo 'OPENAI_API_KEY=osaurus' > benchmarks/EasyLocomo/.env
echo 'OPENAI_API_BASE=http://localhost:1337/v1' >> benchmarks/EasyLocomo/.env

# 3. Ingest LoCoMo data (full extraction — takes several hours, only needed once)
make bench-ingest

# 4. Fast chunk re-ingestion (no LLM calls — use after code changes)
make bench-ingest-chunks

# 5. Run evaluation
make bench-run

tip

You may want to temporarily increase token budgets in the memory configuration file (~/.osaurus/config/memory.json) before running benchmarks. The default production budgets are tuned for everyday use, not maximal recall.

To contribute benchmarks, join the Discord community.

Methodology​

Key Metrics​

Time to First Token (TTFT)​

Throughput​

End-to-End Latency​

Optimization Tips​

Conclusions​

Memory Quality — LoCoMo Benchmark​

LoCoMo Leaderboard​

Osaurus Breakdown by Category​

Running the Benchmark​