Skip to content

Verifier-driven framework for local LLM reliability and adaptive inference

License

Notifications You must be signed in to change notification settings

ranausmanai/reliabilityloop

Repository files navigation

ReliabilityLoop

License: MIT Python Status Policy Reliability Tasks Hugging%20Face%20Dataset Release

Verifier-driven framework for improving local LLM reliability with adaptive inference.

ReliabilityLoop evaluates whether model outputs actually work (not just look plausible), then applies runtime strategies to improve reliability under cost and latency constraints.

ReliabilityLoop Leaderboard Preview

Why ReliabilityLoop

  • Executable reliability checks: JSON schema, SQL execution, Python unit tests
  • Policy routing: choose baseline_first or contract_first per task type
  • Adaptive compute: per-task best-of-k and per-task token budgets
  • Verified memory reuse: reuse proven outputs from earlier runs (wins.jsonl)
  • Reproducible artifacts: every run outputs summary.json, leaderboard.md, samples.jsonl, wins.jsonl

Benchmark Scope (v1)

Canonical split: eval/reliability_v1_60.jsonl

  • 20 JSON tasks
  • 20 SQL tasks
  • 20 code tasks

Spec: eval/RELIABILITY_V1_SPEC.md

Baseline Result (Example)

From examples/leaderboard_60_baseline.md on qwen2.5-coder:0.5b:

model policy reliability json sql code policy latency (s)
qwen2.5-coder:0.5b 0.867 1.000 1.000 0.600 2.428

Install

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Quick Start

reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 \
  --max-tokens 96 \
  --policy-json contract_first \
  --policy-sql baseline_first \
  --policy-code baseline_first

Outputs are saved to eval/reliability_runs/<timestamp>/.

Adaptive Inference Example

reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 \
  --max-tokens 96 \
  --max-tokens-json 256 \
  --best-of-k 1 \
  --policy-json contract_first \
  --policy-sql baseline_first \
  --policy-code baseline_first

Verified Memory Example

# first run
RUN_A=$(reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 | sed -n 's/^- outdir: //p')

# second run with memory
reliabilityloop reliability \
  --backend ollama \
  --model qwen2.5-coder:0.5b \
  --prompts-file eval/reliability_v1_60.jsonl \
  --limit 60 \
  --memory-file "$RUN_A/wins.jsonl" \
  --memory-top-k 2

Hugging Face Dataset

Release Docs

  • RELEASE.md
  • CHANGELOG.md

License

MIT

About

Verifier-driven framework for local LLM reliability and adaptive inference

Resources

License

Stars

Watchers

Forks

Packages

No packages published