halref — Hallucinated Reference Finder

Detect hallucinated references in academic PDFs. Extracts citations from papers, verifies them against multiple academic databases, and produces a ranked report of references most likely to be fabricated.

This tool is designed to work with papers that use acl.sty (the standard ACL/EMNLP/NAACL LaTeX style). The reference extraction pipeline is tuned for the ACL natbib bibliography format and two-column layout. Papers using other styles may produce incomplete or incorrect results.

What it does

Extracts references from PDF locally (no data leaves your machine)
Converts to BibTeX with parsed authors, titles, years, and venues
Verifies each reference against Semantic Scholar, CrossRef, DBLP, and OpenAlex
Scores hallucination likelihood based on title match, author accuracy, year correctness, and cross-database consensus
Reports flagged references with side-by-side comparison showing exactly what differs

Quick start

# Install
pip install -e .

# Extract references from a PDF (no network needed)
halref extract paper.pdf -d output/

# Check a directory of PDFs for hallucinated references
halref check submissions/ -d results/

# Check with all references shown (not just flagged ones)
halref check paper.pdf -d results/ --show-ok

Installation

Requires Python 3.11+.

# Core install
pip install -e .

# With optional backends
pip install -e ".[docling,marker,llm,acl]"

Dependencies

Core (always installed):

pdfminer.six — primary PDF text extraction (best two-column handling)
pypdf — secondary PDF extractor
pdfplumber — fallback PDF extractor
rapidfuzz — fuzzy string matching for title/author comparison
nameparser — author name parsing
bibtexparser — BibTeX reading/writing
aiohttp + aiolimiter — async API clients with rate limiting
typer + rich — CLI and terminal output
pydantic — data models and configuration

Optional:

docling — IBM's layout-aware document converter
marker-pdf — ML-based PDF to markdown
openai — for local LLM verification via vLLM/Ollama
acl-anthology — ACL Anthology local search

Usage

Check references

# Single file
halref check paper.pdf

# Directory of PDFs
halref check submissions/

# Multiple files
halref check paper1.pdf paper2.pdf paper3.pdf

# Custom output directory and threshold
halref check submissions/ -d results/ --threshold 0.3

# Show all references, not just flagged
halref check paper.pdf --show-ok

# With local LLM for ambiguous cases
halref check paper.pdf --llm --llm-base-url http://localhost:11434/v1 --llm-model llama3

Extract references only (no verification)

halref extract paper.pdf -d output/
halref extract submissions/ -d bibs/
halref extract paper.pdf --ref-pages 9-13

Output

All output goes to --outdir (default: halref_output/):

halref_output/
├── bib/
│   ├── paper1.bib          # Extracted references per PDF
│   └── paper2.bib
├── report.json              # Full verification results
└── report_annotated.bib     # BibTeX with hallucination scores as comments

The terminal report shows flagged references with:

Side-by-side comparison of extracted vs. matched reference
Color-coded differences (red = year/author mismatch, yellow = partial match)
Full author lists for both versions
Title similarity percentage

Configuration

Create a config.toml or set environment variables:

# API keys (recommended for faster rate limits)
export SEMANTIC_SCHOLAR_API_KEY="your-key"   # https://www.semanticscholar.org/product/api#api-key
export OPENALEX_API_KEY="your-key"           # https://openalex.org/users/me
export CROSSREF_MAILTO="you@example.com"     # No signup, just your email

Or use a config file:

[extraction]
text_extractors = ["pdfminer"]         # pdfminer, pypdf, pdfplumber, docling, marker
field_parsers = ["regex", "heuristic"] # regex, heuristic, llm, api
ref_pages = ""                         # e.g., "9-13" to limit extraction

[apis.semantic_scholar]
api_key = ""
enabled = true

[apis.crossref]
mailto = ""
enabled = true

[apis.dblp]
enabled = true

[apis.openalex]
api_key = ""
enabled = true

[matching]
title_threshold = 0.85
author_threshold = 0.6

[matching.weights]
title = 0.30
authors = 0.25
author_order = 0.15
year = 0.15
consensus = 0.15

[llm]
enabled = false
base_url = "http://localhost:8000/v1"
model = ""
api_key = ""

A .env file in the project root is also loaded automatically.

How it works

Extraction pipeline

References are extracted locally — no PDF data is sent to any external service.

Text extraction: pdfminer.six handles two-column ACL layouts correctly. Scans backwards through the PDF to find the "References" heading, extracts through the end while stopping at Appendix/Limitations sections. Line numbers from review-mode PDFs are stripped automatically.
Reference splitting: Multiple strategies (blank-line separation, author-year pattern detection, numbered [N] markers) with quality-based selection. A post-processing pass merges fragments split across page boundaries.
Field parsing: Regex parser tuned for ACL natbib format (Author, Author, and Author. Year. Title. In Venue.) with a heuristic fallback. Handles hyphenated words, accented names, and [N] prefixes.
Validation: Entries with single-word titles, author-list-as-title patterns, or missing critical fields are filtered from both the BibTeX output and the report.

Verification pipeline

Batch deduplication: When checking multiple PDFs, references are deduplicated across papers. Shared references (appearing in 3+ papers) are searched with lower priority since they're likely real.
Waterfall API search: APIs are queried in priority order (Semantic Scholar → CrossRef → DBLP → OpenAlex). Search stops early when a confident match is found (title similarity > 90%), avoiding unnecessary API calls.
Agentic retry strategies: If direct title search fails, the system tries: removing subtitles, searching by first author + year, partial title search with distinctive terms, and venue-constrained search.
Repair: Truncated references (from column-break extraction artifacts) are repaired by querying APIs for the complete metadata. All repaired data is verified against the original PDF text to prevent introducing hallucinated metadata.
Scoring: Each reference gets a hallucination score (0-1) based on weighted signals:
- Title mismatch (30%)
- Author mismatch (25%)
- Author order wrong (15%)
- Year mismatch (15%)
- Low cross-API consensus (15%)

Hallucination signals

The tool flags these specific patterns:

Signal	Description
Not found	No database contains a paper with a similar title
Wrong first author	The first author differs from the matched paper
Author order swapped	Same authors but in a different order
Year mismatch	Publication year differs (e.g., paper says 2019, actual is 2017)
Different title	Title is similar but not the same (possible fabrication)
Low consensus	Found in only 1 database instead of multiple

API rate limits

API	Without key	With key	Key signup
Semantic Scholar	~0.3 RPS (shared pool)	1 RPS (dedicated)	Free
CrossRef	~1 RPS	~5 RPS (polite pool)	Just set your email
DBLP	~2 RPS	N/A	No key needed
OpenAlex	~1 RPS	~5 RPS	Free

The tool respects rate limits automatically with per-API throttling and exponential backoff on 429 responses.

Testing

# Generate test PDFs (requires pdflatex)
python tests/fixtures/generate_test_pdf.py

# Run extraction tests (synthetic papers with known references)
python tests/test_extraction.py

# Download 10 real ACL papers from arxiv for ground truth testing
python tests/fixtures/download_acl_papers.py

# Run ground truth extraction tests
python tests/test_bib_extraction.py

Test results across 10 real ACL papers

Metric	Value
Precision	73-99% (median ~91%)
Recall	37-90% (median ~73%)
Year accuracy	96-100%
False positives	0-3 per paper

Library usage

import asyncio
from pathlib import Path
from halref.config import Config
from halref.extract.ensemble import extract_references
from halref.pipeline import run_check

# Extract references (offline)
config = Config.default()
refs = extract_references(Path("paper.pdf"), config)
for ref in refs:
    print(f"{ref.authors[0]} ({ref.year}). {ref.title}")

# Full verification
report = asyncio.run(run_check([Path("paper.pdf")], config))
for result in report.reports[0].ranked():
    if result.hallucination_score > 0.5:
        print(f"FLAGGED: {result.reference.title}")
        print(f"  Score: {result.hallucination_score:.2f}")
        print(f"  Signals: {result.signal_summary()}")

Privacy

All PDF processing happens locally. No PDF content is ever sent to external services. Only reference metadata (titles, author names, years) is sent to academic search APIs during the check command. The extract command is fully offline.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
acl-style		acl-style
src/halref		src/halref
tests		tests
.gitignore		.gitignore
README.md		README.md
config.example.toml		config.example.toml
pyproject.toml		pyproject.toml
requirements-extras.txt		requirements-extras.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

halref — Hallucinated Reference Finder

What it does

Quick start

Installation

Dependencies

Usage

Check references

Extract references only (no verification)

Output

Configuration

How it works

Extraction pipeline

Verification pipeline

Hallucination signals

API rate limits

Testing

Test results across 10 real ACL papers

Library usage

Privacy

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

halref — Hallucinated Reference Finder

What it does

Quick start

Installation

Dependencies

Usage

Check references

Extract references only (no verification)

Output

Configuration

How it works

Extraction pipeline

Verification pipeline

Hallucination signals

API rate limits

Testing

Test results across 10 real ACL papers

Library usage

Privacy

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages