Verifier-driven framework for improving local LLM reliability with adaptive inference.
ReliabilityLoop evaluates whether model outputs actually work (not just look plausible), then applies runtime strategies to improve reliability under cost and latency constraints.
- Executable reliability checks: JSON schema, SQL execution, Python unit tests
- Policy routing: choose
baseline_firstorcontract_firstper task type - Adaptive compute: per-task
best-of-kand per-task token budgets - Verified memory reuse: reuse proven outputs from earlier runs (
wins.jsonl) - Reproducible artifacts: every run outputs
summary.json,leaderboard.md,samples.jsonl,wins.jsonl
Canonical split: eval/reliability_v1_60.jsonl
- 20 JSON tasks
- 20 SQL tasks
- 20 code tasks
Spec: eval/RELIABILITY_V1_SPEC.md
From examples/leaderboard_60_baseline.md on qwen2.5-coder:0.5b:
| model | policy reliability | json | sql | code | policy latency (s) |
|---|---|---|---|---|---|
qwen2.5-coder:0.5b |
0.867 | 1.000 | 1.000 | 0.600 | 2.428 |
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"reliabilityloop reliability \
--backend ollama \
--model qwen2.5-coder:0.5b \
--prompts-file eval/reliability_v1_60.jsonl \
--limit 60 \
--max-tokens 96 \
--policy-json contract_first \
--policy-sql baseline_first \
--policy-code baseline_firstOutputs are saved to eval/reliability_runs/<timestamp>/.
reliabilityloop reliability \
--backend ollama \
--model qwen2.5-coder:0.5b \
--prompts-file eval/reliability_v1_60.jsonl \
--limit 60 \
--max-tokens 96 \
--max-tokens-json 256 \
--best-of-k 1 \
--policy-json contract_first \
--policy-sql baseline_first \
--policy-code baseline_first# first run
RUN_A=$(reliabilityloop reliability \
--backend ollama \
--model qwen2.5-coder:0.5b \
--prompts-file eval/reliability_v1_60.jsonl \
--limit 60 | sed -n 's/^- outdir: //p')
# second run with memory
reliabilityloop reliability \
--backend ollama \
--model qwen2.5-coder:0.5b \
--prompts-file eval/reliability_v1_60.jsonl \
--limit 60 \
--memory-file "$RUN_A/wins.jsonl" \
--memory-top-k 2RELEASE.mdCHANGELOG.md
MIT