Detect hallucinated references in academic PDFs. Extracts citations from papers, verifies them against multiple academic databases, and produces a ranked report of references most likely to be fabricated.
This tool is designed to work with papers that use acl.sty (the standard ACL/EMNLP/NAACL LaTeX style). The reference extraction pipeline is tuned for the ACL natbib bibliography format and two-column layout. Papers using other styles may produce incomplete or incorrect results.
- Extracts references from PDF locally (no data leaves your machine)
- Converts to BibTeX with parsed authors, titles, years, and venues
- Verifies each reference against Semantic Scholar, CrossRef, DBLP, and OpenAlex
- Scores hallucination likelihood based on title match, author accuracy, year correctness, and cross-database consensus
- Reports flagged references with side-by-side comparison showing exactly what differs
# Install
pip install -e .
# Extract references from a PDF (no network needed)
halref extract paper.pdf -d output/
# Check a directory of PDFs for hallucinated references
halref check submissions/ -d results/
# Check with all references shown (not just flagged ones)
halref check paper.pdf -d results/ --show-okRequires Python 3.11+.
# Core install
pip install -e .
# With optional backends
pip install -e ".[docling,marker,llm,acl]"Core (always installed):
pdfminer.six— primary PDF text extraction (best two-column handling)pypdf— secondary PDF extractorpdfplumber— fallback PDF extractorrapidfuzz— fuzzy string matching for title/author comparisonnameparser— author name parsingbibtexparser— BibTeX reading/writingaiohttp+aiolimiter— async API clients with rate limitingtyper+rich— CLI and terminal outputpydantic— data models and configuration
Optional:
docling— IBM's layout-aware document convertermarker-pdf— ML-based PDF to markdownopenai— for local LLM verification via vLLM/Ollamaacl-anthology— ACL Anthology local search
# Single file
halref check paper.pdf
# Directory of PDFs
halref check submissions/
# Multiple files
halref check paper1.pdf paper2.pdf paper3.pdf
# Custom output directory and threshold
halref check submissions/ -d results/ --threshold 0.3
# Show all references, not just flagged
halref check paper.pdf --show-ok
# With local LLM for ambiguous cases
halref check paper.pdf --llm --llm-base-url http://localhost:11434/v1 --llm-model llama3halref extract paper.pdf -d output/
halref extract submissions/ -d bibs/
halref extract paper.pdf --ref-pages 9-13All output goes to --outdir (default: halref_output/):
halref_output/
├── bib/
│ ├── paper1.bib # Extracted references per PDF
│ └── paper2.bib
├── report.json # Full verification results
└── report_annotated.bib # BibTeX with hallucination scores as comments
The terminal report shows flagged references with:
- Side-by-side comparison of extracted vs. matched reference
- Color-coded differences (red = year/author mismatch, yellow = partial match)
- Full author lists for both versions
- Title similarity percentage
Create a config.toml or set environment variables:
# API keys (recommended for faster rate limits)
export SEMANTIC_SCHOLAR_API_KEY="your-key" # https://www.semanticscholar.org/product/api#api-key
export OPENALEX_API_KEY="your-key" # https://openalex.org/users/me
export CROSSREF_MAILTO="you@example.com" # No signup, just your emailOr use a config file:
[extraction]
text_extractors = ["pdfminer"] # pdfminer, pypdf, pdfplumber, docling, marker
field_parsers = ["regex", "heuristic"] # regex, heuristic, llm, api
ref_pages = "" # e.g., "9-13" to limit extraction
[apis.semantic_scholar]
api_key = ""
enabled = true
[apis.crossref]
mailto = ""
enabled = true
[apis.dblp]
enabled = true
[apis.openalex]
api_key = ""
enabled = true
[matching]
title_threshold = 0.85
author_threshold = 0.6
[matching.weights]
title = 0.30
authors = 0.25
author_order = 0.15
year = 0.15
consensus = 0.15
[llm]
enabled = false
base_url = "http://localhost:8000/v1"
model = ""
api_key = ""A .env file in the project root is also loaded automatically.
References are extracted locally — no PDF data is sent to any external service.
-
Text extraction: pdfminer.six handles two-column ACL layouts correctly. Scans backwards through the PDF to find the "References" heading, extracts through the end while stopping at Appendix/Limitations sections. Line numbers from review-mode PDFs are stripped automatically.
-
Reference splitting: Multiple strategies (blank-line separation, author-year pattern detection, numbered
[N]markers) with quality-based selection. A post-processing pass merges fragments split across page boundaries. -
Field parsing: Regex parser tuned for ACL natbib format (
Author, Author, and Author. Year. Title. In Venue.) with a heuristic fallback. Handles hyphenated words, accented names, and[N]prefixes. -
Validation: Entries with single-word titles, author-list-as-title patterns, or missing critical fields are filtered from both the BibTeX output and the report.
-
Batch deduplication: When checking multiple PDFs, references are deduplicated across papers. Shared references (appearing in 3+ papers) are searched with lower priority since they're likely real.
-
Waterfall API search: APIs are queried in priority order (Semantic Scholar → CrossRef → DBLP → OpenAlex). Search stops early when a confident match is found (title similarity > 90%), avoiding unnecessary API calls.
-
Agentic retry strategies: If direct title search fails, the system tries: removing subtitles, searching by first author + year, partial title search with distinctive terms, and venue-constrained search.
-
Repair: Truncated references (from column-break extraction artifacts) are repaired by querying APIs for the complete metadata. All repaired data is verified against the original PDF text to prevent introducing hallucinated metadata.
-
Scoring: Each reference gets a hallucination score (0-1) based on weighted signals:
- Title mismatch (30%)
- Author mismatch (25%)
- Author order wrong (15%)
- Year mismatch (15%)
- Low cross-API consensus (15%)
The tool flags these specific patterns:
| Signal | Description |
|---|---|
| Not found | No database contains a paper with a similar title |
| Wrong first author | The first author differs from the matched paper |
| Author order swapped | Same authors but in a different order |
| Year mismatch | Publication year differs (e.g., paper says 2019, actual is 2017) |
| Different title | Title is similar but not the same (possible fabrication) |
| Low consensus | Found in only 1 database instead of multiple |
| API | Without key | With key | Key signup |
|---|---|---|---|
| Semantic Scholar | ~0.3 RPS (shared pool) | 1 RPS (dedicated) | Free |
| CrossRef | ~1 RPS | ~5 RPS (polite pool) | Just set your email |
| DBLP | ~2 RPS | N/A | No key needed |
| OpenAlex | ~1 RPS | ~5 RPS | Free |
The tool respects rate limits automatically with per-API throttling and exponential backoff on 429 responses.
# Generate test PDFs (requires pdflatex)
python tests/fixtures/generate_test_pdf.py
# Run extraction tests (synthetic papers with known references)
python tests/test_extraction.py
# Download 10 real ACL papers from arxiv for ground truth testing
python tests/fixtures/download_acl_papers.py
# Run ground truth extraction tests
python tests/test_bib_extraction.py| Metric | Value |
|---|---|
| Precision | 73-99% (median ~91%) |
| Recall | 37-90% (median ~73%) |
| Year accuracy | 96-100% |
| False positives | 0-3 per paper |
import asyncio
from pathlib import Path
from halref.config import Config
from halref.extract.ensemble import extract_references
from halref.pipeline import run_check
# Extract references (offline)
config = Config.default()
refs = extract_references(Path("paper.pdf"), config)
for ref in refs:
print(f"{ref.authors[0]} ({ref.year}). {ref.title}")
# Full verification
report = asyncio.run(run_check([Path("paper.pdf")], config))
for result in report.reports[0].ranked():
if result.hallucination_score > 0.5:
print(f"FLAGGED: {result.reference.title}")
print(f" Score: {result.hallucination_score:.2f}")
print(f" Signals: {result.signal_summary()}")All PDF processing happens locally. No PDF content is ever sent to external services. Only reference metadata (titles, author names, years) is sent to academic search APIs during the check command. The extract command is fully offline.
MIT