ReliabilityLoop

Verifier-driven framework for improving local LLM reliability with adaptive inference.

ReliabilityLoop evaluates whether model outputs actually work (not just look plausible), then applies runtime strategies to improve reliability under cost and latency constraints.

Why ReliabilityLoop

Executable reliability checks: JSON schema, SQL execution, Python unit tests
Policy routing: choose baseline_first or contract_first per task type
Adaptive compute: per-task best-of-k and per-task token budgets
Verified memory reuse: reuse proven outputs from earlier runs (wins.jsonl)
Reproducible artifacts: every run outputs summary.json, leaderboard.md, samples.jsonl, wins.jsonl

Benchmark Scope (v1)

Canonical split: eval/reliability_v1_60.jsonl

20 JSON tasks
20 SQL tasks
20 code tasks

Spec: eval/RELIABILITY_V1_SPEC.md

Baseline Result (Example)

From examples/leaderboard_60_baseline.md on qwen2.5-coder:0.5b:

model	policy reliability	json	sql	code	policy latency (s)
`qwen2.5-coder:0.5b`	0.867	1.000	1.000	0.600	2.428

Install

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Quick Start

reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 \
  --max-tokens 96 \
  --policy-json contract_first \
  --policy-sql baseline_first \
  --policy-code baseline_first

Outputs are saved to eval/reliability_runs/<timestamp>/.

Adaptive Inference Example

reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 \
  --max-tokens 96 \
  --max-tokens-json 256 \
  --best-of-k 1 \
  --policy-json contract_first \
  --policy-sql baseline_first \
  --policy-code baseline_first

Verified Memory Example

# first run
RUN_A=$(reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 | sed -n 's/^- outdir: //p')

# second run with memory
reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 \
  --memory-file "$RUN_A/wins.jsonl" \
  --memory-top-k 2

Hugging Face Dataset

https://huggingface.co/datasets/ranausmans/reliabilityloop-v1

Release Docs

RELEASE.md
CHANGELOG.md

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
eval		eval
examples		examples
paper		paper
src/autoquality		src/autoquality
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
GITHUB_RELEASE_v0.2.0-alpha.md		GITHUB_RELEASE_v0.2.0-alpha.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReliabilityLoop

Why ReliabilityLoop

Benchmark Scope (v1)

Baseline Result (Example)

Install

Quick Start

Adaptive Inference Example

Verified Memory Example

Hugging Face Dataset

Release Docs

License

About

Uh oh!

Releases 1

Packages

Languages

License

ranausmanai/reliabilityloop

Folders and files

Latest commit

History

Repository files navigation

ReliabilityLoop

Why ReliabilityLoop

Benchmark Scope (v1)

Baseline Result (Example)

Install

Quick Start

Adaptive Inference Example

Verified Memory Example

Hugging Face Dataset

Release Docs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages