Skip to content

jemsbhai/corpusgen

corpusgen

DOI

Language-agnostic framework for generating and evaluating speech corpora with maximal phoneme coverage.

corpusgen helps you build phonetically-balanced text corpora for speech synthesis (TTS), speech recognition (ASR), and clinical speech assessment — in any language.

Features

  • Evaluate any text corpus for phoneme, diphone, or triphone coverage

  • PHOIBLE integration — phoneme inventories for 2,186 languages (3,020 inventories)

  • Grapheme-to-phoneme via espeak-ng for 100+ languages

  • Espeak ↔ PHOIBLE mapping — seamless bridge between G2P and phonological databases

  • Distribution quality metrics — Shannon entropy, normalized entropy, JSD (vs uniform or reference), Pearson correlation, coefficient of variation, PCD composite score

  • Coverage trajectory tracking — step-by-step coverage saturation curves for any selection or generation result

  • Text quality metrics — sentence length stats, vocabulary diversity (TTR, hapax ratio), Flesch readability scores

  • Error rate metrics — WER, CER, PER, SER with per-sentence breakdowns and corpus-level micro-averaging

  • Corpus-level perplexity — batched LM perplexity via GPT-2 (or any causal LM), both token-weighted corpus perplexity and sentence-weighted mean, with model sharing support

  • Structured reports — three verbosity levels, JSON export, JSON-LD-EX compatibility

  • 40-language test suite — validated across 12 language families

  • 6 selection algorithms for corpus optimization:

    • Greedy Set Cover — ln(n)+1 approximation, the standard workhorse
    • CELF — lazy evaluation speedup, identical results up to 700× faster
    • Stochastic Greedy — (1-1/e-ε) approximation, scales to massive corpora
    • ILP — exact optimal solutions via Integer Linear Programming (ground truth)
    • Distribution-Aware — KL-divergence minimization for frequency matching
    • NSGA-II — multi-objective Pareto optimization (coverage × cost × distribution)
  • Phoneme weighting — uniform, frequency-inverse, and linguistic class strategies

  • Phon-CTG generation framework — orchestrated corpus generation with pluggable backends:

    • Repository backend — select from sentence pools (pre-phonemized, raw text, or HuggingFace datasets)
    • LLM API backend — generate targeted sentences via OpenAI/Anthropic/Ollama (BYO API key)
    • Local model backend — HuggingFace transformers with CUDA auto-detect and 4-bit/8-bit quantization
  • Phon-DATG — inference-time logit steering for phonetically-targeted local generation

  • Phon-RL — PPO-based policy fine-tuning with composite phonetic reward (custom implementation, no trl dependency)

  • Built-in scorers — n-gram phonotactic naturalness + LM perplexity fluency scoring

  • CLIcorpusgen evaluate, corpusgen select, corpusgen inventory, corpusgen generate from the command line

Prerequisites

espeak-ng (required)

corpusgen uses espeak-ng for grapheme-to-phoneme conversion. Install it before using corpusgen.

Windows
  1. Download the latest .msi installer from espeak-ng releases
  2. Run the installer (default path: C:\Program Files\eSpeak NG\)
  3. Set the environment variable so Python can find the shared library:
[Environment]::SetEnvironmentVariable("PHONEMIZER_ESPEAK_LIBRARY", "C:\Program Files\eSpeak NG\libespeak-ng.dll", "User")
  1. Restart your terminal and verify:
espeak-ng --version
macOS
brew install espeak-ng
Linux (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y espeak-ng
Docker / CI
RUN apt-get update && apt-get install -y espeak-ng && rm -rf /var/lib/apt/lists/*

PHOIBLE data (recommended)

To use PHOIBLE phoneme inventories (2,186 languages), download the data on first use:

from corpusgen.inventory import PhoibleDataset
PhoibleDataset().download()  # cached at ~/.corpusgen/phoible.csv (~24 MB)

This only needs to be done once.

Installation

From PyPI

pip install corpusgen

Development setup

git clone https://github.com/jemsbhai/corpusgen.git
cd corpusgen
poetry install
poetry run pytest

With local model support (GPU recommended)

For Phon-RL training and Phon-DATG logit steering with local models:

# 1. Install corpusgen with local model dependencies
poetry install --with local

# 2. IMPORTANT: Replace CPU torch with CUDA torch for GPU acceleration.
#    The default Poetry install pulls CPU-only torch from PyPI.
#    For NVIDIA GPUs (CUDA 12.1):
pip install torch --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

# Verify GPU is available:
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

Note: Check pytorch.org/get-started for the correct CUDA version matching your driver. Common options: cu118, cu121, cu124.

Quick Start

Evaluate a corpus for phoneme coverage

import corpusgen

report = corpusgen.evaluate(
    ["The quick brown fox jumps over the lazy dog.",
     "She sells seashells by the seashore.",
     "Pack my box with five dozen liquor jugs."],
    language="en-us",
    target_phonemes="phoible",
)

print(report.render())
print(report.coverage)           # 0.65
print(report.missing_phonemes)   # {'ʒ', 'ð', 'θ', ...}

Select optimal sentences from a candidate pool

import corpusgen

candidates = [
    "The quick brown fox jumps over the lazy dog.",
    "She sells seashells by the seashore.",
    "Peter Piper picked a peck of pickled peppers.",
    "How much wood would a woodchuck chuck?",
    "To be or not to be, that is the question.",
]

result = corpusgen.select_sentences(
    candidates,
    language="en-us",
    algorithm="greedy",  # or "celf", "stochastic", "ilp", "distribution", "nsga2"
)

print(f"Selected {result.num_selected} of {len(candidates)} sentences")
print(f"Coverage: {result.coverage:.1%}")

Generate a corpus from a sentence pool

The fastest way is the CLI:

# Select best sentences from a pool for maximal phoneme coverage
corpusgen generate -b repository -l en-us --file pool.txt --max-sentences 50

# With multi-objective scoring (coverage + phonotactic naturalness)
corpusgen generate -b repository -l en-us --file pool.txt \
  --coverage-weight 0.7 --phonotactic-weight 0.3 --phonotactic-scorer ngram

Or use the Python API for full control:

from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_ctg.scorer import PhoneticScorer
from corpusgen.generate.phon_ctg.loop import GenerationLoop, StoppingCriteria
from corpusgen.generate.backends.repository import RepositoryBackend
from corpusgen.g2p.manager import G2PManager

# 1. Phonemize a sentence pool
g2p = G2PManager()
sentences = ["The cat sat on the mat.", "Big dogs bark loudly.", ...]
results = g2p.phonemize_batch(sentences, language="en-us")
pool = [
    {"text": s, "phonemes": r.phonemes}
    for s, r in zip(sentences, results) if r.phonemes
]

# 2. Set up targets, scorer, and backend
targets = PhoneticTargetInventory(
    target_phonemes=["p", "b", "t", "d", "k", "g"],
    unit="phoneme",
)
scorer = PhoneticScorer(targets=targets, coverage_weight=1.0)
backend = RepositoryBackend(pool=pool)

# 3. Run the generation loop
loop = GenerationLoop(
    backend=backend,
    targets=targets,
    scorer=scorer,
    stopping_criteria=StoppingCriteria(
        target_coverage=0.9,
        max_sentences=20,
    ),
)
result = loop.run()

print(f"Generated {result.num_generated} sentences, coverage: {result.coverage:.1%}")

Generate with an LLM API

Via CLI:

# Requires: poetry install --with llm
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini --max-sentences 20

Or Python API:

from corpusgen.generate.backends.llm_api import LLMBackend

# Requires: poetry install --with llm
# Set your API key: export OPENAI_API_KEY=...
backend = LLMBackend(
    model="gpt-4o-mini",
    language="en-us",
)

# Use with the same GenerationLoop as above
loop = GenerationLoop(
    backend=backend,
    targets=targets,
    scorer=scorer,
    stopping_criteria=StoppingCriteria(target_coverage=0.9),
)
result = loop.run()

Fine-tune a model with phonetic reward (Phon-RL)

from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_rl.reward import PhoneticReward
from corpusgen.generate.phon_rl.trainer import PhonRLTrainer, TrainingConfig

# Requires: poetry install --with local

# 1. Define targets and reward
targets = PhoneticTargetInventory(
    target_phonemes=["p", "b", "t", "d", "k"],
    unit="phoneme",
)
reward = PhoneticReward(targets=targets, coverage_weight=1.0)

# 2. Configure PPO training
config = TrainingConfig(
    model_name="gpt2",
    num_steps=100,
    learning_rate=1e-5,
    kl_coeff=0.1,
    use_peft=True,     # LoRA for parameter-efficient training
    peft_r=8,
    peft_alpha=16,
    device=None,        # auto-detect GPU
)

# 3. Train with dynamic prompts that adapt to coverage gaps
def make_prompt(targets):
    missing = targets.next_targets(5)
    return f"Write a sentence using these sounds: {', '.join(missing)}"

trainer = PhonRLTrainer(reward=reward, config=config)
result = trainer.train(prompt_fn=make_prompt)

print(f"Final coverage: {result.final_coverage:.1%}")
trainer.save_checkpoint("./phon_rl_checkpoint")

Use PHOIBLE inventories directly

from corpusgen import get_inventory

inv = get_inventory("en-us")
print(inv.language_name)          # 'English'
print(inv.consonants)             # ['p', 'b', 't', 'd', 'k', ...]
print(inv.vowels)                 # ['iː', 'ɪ', 'ɛ', 'æ', ...]

# Query by distinctive features
nasals = inv.segments_with_feature("nasal", "+")
print([s.phoneme for s in nasals])  # ['m', 'n', 'ŋ']

Evaluate with diphone or triphone coverage

import corpusgen

report = corpusgen.evaluate(
    ["The quick brown fox jumps."],
    language="en-us",
    target_phonemes="phoible",
    unit="diphone",
)
print(f"Diphone coverage: {report.coverage:.1%}")

Export reports

import corpusgen

report = corpusgen.evaluate(
    ["The quick brown fox."],
    language="en-us",
    target_phonemes="phoible",
)

# JSON
print(report.to_json(indent=2))

# JSON-LD (linked data)
doc = report.to_jsonld_ex()

# Human-readable at different verbosity levels
from corpusgen.evaluate.report import Verbosity
print(report.render(verbosity=Verbosity.MINIMAL))
print(report.render(verbosity=Verbosity.NORMAL))
print(report.render(verbosity=Verbosity.VERBOSE))

Analyze distribution quality

import corpusgen

report = corpusgen.evaluate(
    ["The cat sat on the mat.", "Big dogs dig deep holes."],
    language="en-us",
    target_phonemes="phoible",
)

# Distribution metrics are auto-computed
dm = report.distribution
print(f"Normalized entropy: {dm.normalized_entropy:.4f}")  # 1.0 = perfectly uniform
print(f"JSD vs uniform: {dm.jsd_uniform:.6f}")             # 0.0 = perfectly uniform
print(f"PCD (uniform): {dm.pcd_uniform:.4f}")              # coverage × (1 - JSD)

# Compare against a natural language reference distribution
from corpusgen.evaluate.distribution import compute_distribution_metrics

reference = {"p": 0.04, "t": 0.07, "k": 0.03, "ə": 0.12}  # example frequencies
dm_ref = compute_distribution_metrics(
    report.phoneme_counts, report.target_phonemes, reference_distribution=reference
)
print(f"JSD vs reference: {dm_ref.jsd_reference:.6f}")
print(f"Pearson correlation: {dm_ref.pearson_correlation}")

Plot coverage saturation curves

from corpusgen.evaluate.trajectory import compute_coverage_trajectory

# From a SelectionResult
traj = compute_coverage_trajectory(
    [candidate_phonemes[i] for i in result.selected_indices],
    target_units=result.covered_units | result.missing_units,
    unit=result.unit,
)

# Easy plotting
import matplotlib.pyplot as plt
plt.plot(range(len(traj.coverages)), traj.coverages)
plt.xlabel("Sentences")
plt.ylabel("Coverage")
plt.title("Coverage Saturation Curve")
plt.show()

# Access marginal gains per sentence
print(traj.gains)  # [5, 3, 2, 1, 1, 0, ...]

Evaluate text quality

import corpusgen

report = corpusgen.evaluate(
    ["The cat sat on the mat.", "Big dogs dig deep holes."],
    language="en-us",
)

# Text quality metrics are auto-computed
tq = report.text_quality
print(f"Type-Token Ratio: {tq.type_token_ratio:.3f}")
print(f"Flesch Reading Ease: {tq.flesch_reading_ease:.1f}")
print(f"Avg sentence length: {tq.sentence_length_words_mean:.1f} words")

Measure corpus perplexity

from corpusgen.evaluate.perplexity import compute_corpus_perplexity

# Simple — loads GPT-2 automatically (requires: poetry install --with local)
metrics = compute_corpus_perplexity(
    ["The cat sat on the mat.", "Big dogs dig deep holes."],
    model_name="gpt2",
)

print(f"Corpus perplexity: {metrics.corpus_perplexity:.2f}")  # token-weighted (standard LM metric)
print(f"Mean perplexity:   {metrics.mean_perplexity:.2f}")    # sentence-weighted
print(f"Median:            {metrics.median_perplexity:.2f}")
print(f"Total tokens:      {metrics.num_tokens}")

# Per-sentence breakdown
for i, ppl in enumerate(metrics.per_sentence):
    print(f"  Sentence {i}: PPL = {ppl:.2f}")

# Shared model — avoids loading the same model twice when you are
# also using PerplexityFluencyScorer during generation:
from corpusgen.generate.scorers.fluency import PerplexityFluencyScorer

scorer = PerplexityFluencyScorer(model_name="gpt2", device="cuda")
scorer("warm-up call to trigger lazy load")

metrics = compute_corpus_perplexity(
    sentences,
    model=scorer._model,
    tokenizer=scorer._tokenizer,
)

Compare transcriptions with error rates

from corpusgen.evaluate.error_rates import compute_error_rates

result = compute_error_rates(
    references=["the cat sat on the mat", "big dogs dig deep holes"],
    hypotheses=["the cat sat on a mat", "big dog dig deep hole"],
)

print(f"WER: {result.wer:.2%}")   # corpus-level, micro-averaged
print(f"CER: {result.cer:.2%}")
print(f"SER: {result.ser:.2%}")

# With phoneme-level comparison
result = compute_error_rates(
    references=["the cat"],
    hypotheses=["a cat"],
    reference_phonemes=[["\u00f0", "\u0259", "k", "\u00e6", "t"]],
    hypothesis_phonemes=[["\u0259", "k", "\u00e6", "t"]],
)
print(f"PER: {result.per:.2%}")

# Per-sentence breakdown
for d in result.details:
    print(f"  [{d.index}] WER={d.wer:.2%} CER={d.cer:.2%}")

CLI Usage

# Show PHOIBLE phoneme inventory for a language
corpusgen inventory --language en-us
corpusgen inventory --language fr-fr --format json
corpusgen inventory --language en-us --source upsid

# Evaluate a corpus for phoneme coverage
corpusgen evaluate "The cat sat on the mat." --language en-us
corpusgen evaluate --file corpus.txt --language en-us --target phoible
corpusgen evaluate --file corpus.txt -l en-us --unit diphone --format json
corpusgen evaluate --file corpus.txt -l en-us --verbosity verbose

# Select optimal sentences from a candidate pool
corpusgen select --file candidates.txt --language en-us
corpusgen select -f pool.txt -l en-us --algorithm celf --max-sentences 50
corpusgen select -f pool.txt -l en-us --target phoible --target-coverage 0.95
corpusgen select -f pool.txt -l en-us --output selected.txt --format json

# Generate sentences targeting phoneme coverage
# --- Repository backend (sentence pool) ---
corpusgen generate -b repository -l en-us --file pool.txt --max-sentences 50
corpusgen generate -b repository -l en-us --file pool.txt --unit diphone --format json
corpusgen generate -b repository -l en-us --file pool.txt --phonemes "ʃ,ʒ,θ" --weights "ʃ:2.0,θ:1.5"
corpusgen generate -b repository -l en-us --file pool.txt --output generated.txt

# --- Repository backend with HuggingFace dataset ---
corpusgen generate -b repository -l en-us --dataset wikitext --split train --max-samples 1000

# --- LLM API backend (requires API key) ---
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini --max-sentences 20
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini --api-key sk-... --llm-temperature 0.9

# --- Local model backend (requires torch) ---
corpusgen generate -b local -l en-us --model gpt2 --device cuda --max-sentences 30
corpusgen generate -b local -l en-us --model gpt2 --quantization 4bit --local-temperature 0.7

# --- With built-in scorers (multi-objective candidate ranking) ---
corpusgen generate -b repository -l en-us --file pool.txt \
  --coverage-weight 0.6 \
  --phonotactic-weight 0.3 --phonotactic-scorer ngram \
  --fluency-weight 0.1 --fluency-scorer perplexity --fluency-model gpt2

# --- With corpus-trained phonotactic model ---
corpusgen generate -b repository -l en-us --file pool.txt \
  --phonotactic-weight 0.3 --phonotactic-scorer ngram \
  --phonotactic-corpus reference.txt --phonotactic-n 3

# --- With guidance strategies (local backend only) ---
corpusgen generate -b local -l en-us --model gpt2 --guidance datg --datg-boost 5.0
corpusgen generate -b local -l en-us --model gpt2 --guidance rl --rl-adapter-path ./checkpoint
corpusgen generate -b local -l en-us --model gpt2 --guidance datg --guidance-config datg.json

# --- Custom prompt templates ---
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini \
  --prompt-template "Write {k} English sentences containing: {target_units}"
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini \
  --prompt-template prompt.txt

Architecture

corpusgen/
├── cli/                  # Command-line interface
│   ├── evaluate.py       # corpusgen evaluate
│   ├── generate.py       # corpusgen generate
│   ├── inventory.py      # corpusgen inventory
│   └── select.py         # corpusgen select
├── g2p/                  # Grapheme-to-phoneme conversion
│   ├── manager.py        # G2PManager — multi-backend G2P (espeak-ng)
│   └── result.py         # G2PResult — phonemes, diphones, triphones
├── coverage/
│   └── tracker.py        # CoverageTracker — phoneme/diphone/triphone tracking
├── evaluate/
│   ├── evaluate.py       # evaluate() — top-level API
│   ├── report.py         # EvaluationReport, Verbosity
│   ├── distribution.py   # DistributionMetrics — JSD, entropy, PCD, Pearson
│   ├── trajectory.py     # CoverageTrajectory — step-by-step saturation curves
│   ├── text_quality.py   # TextQualityMetrics — TTR, readability, sentence stats
│   ├── error_rates.py    # WER, CER, PER, SER with edit distance
│   └── perplexity.py     # Corpus-level perplexity (batched, GPU-accelerated)
├── inventory/
│   ├── models.py         # Segment (38 features), Inventory
│   ├── phoible.py        # PhoibleDataset — PHOIBLE loader/cache/query
│   └── mapping.py        # EspeakMapping — espeak ↔ ISO 639-3
├── select/
│   ├── greedy.py         # GreedySelector
│   ├── celf.py           # CELFSelector (lazy evaluation)
│   ├── stochastic.py     # StochasticGreedySelector
│   ├── ilp.py            # ILPSelector (exact, optional: pulp)
│   ├── distribution.py   # DistributionAwareSelector (KL-divergence)
│   └── nsga2.py          # NSGA2Selector (Pareto, optional: pymoo)
├── weights/              # Phoneme weighting strategies
├── generate/
│   ├── phon_ctg/         # Orchestration framework
│   │   ├── targets.py    # PhoneticTargetInventory
│   │   ├── scorer.py     # PhoneticScorer (coverage + phonotactic + fluency)
│   │   ├── constraints.py # PhonotacticConstraint ABC + N-gram model
│   │   └── loop.py       # GenerationLoop + StoppingCriteria
│   ├── scorers/          # Built-in scoring functions
│   │   ├── phonotactic.py # NgramPhonotacticScorer (save/load, corpus-trained)
│   │   └── fluency.py    # PerplexityFluencyScorer (lazy LM, model sharing)
│   ├── phon_rl/          # RL-based guidance (PPO)
│   │   ├── reward.py     # PhoneticReward (composite, hierarchical)
│   │   ├── trainer.py    # PhonRLTrainer (custom PPO, no trl)
│   │   ├── policy.py     # PhonRLStrategy (GuidanceStrategy wrapper)
│   │   └── value_head.py # ValueHead (nn.Module for GAE)
│   ├── phon_datg/        # Inference-time logit steering
│   │   ├── attribute_words.py  # Vocabulary phonemization + index
│   │   ├── modulator.py  # Additive logit modulation
│   │   └── graph.py      # DATGStrategy (GuidanceStrategy)
│   ├── guidance.py       # GuidanceStrategy ABC
│   └── backends/         # Pluggable generation engines
│       ├── repository.py # Sentence pool selection + HuggingFace datasets
│       ├── llm_api.py    # Multi-provider LLM API (litellm)
│       └── local.py      # HuggingFace transformers + quantization

Language Support

corpusgen supports any language available in both espeak-ng and PHOIBLE:

  • G2P (espeak-ng): 100+ languages
  • Inventories (PHOIBLE): 2,186 languages, 3,020 inventories, 8 sources
  • Tested across: 40 languages, 12 language families, 10+ scripts

The espeak-to-PHOIBLE mapping covers 85+ languages with automatic macrolanguage resolution (e.g., ms → Standard Malay, sw → Swahili).

Reproducibility

For reproducible results across machines:

  1. Pin corpusgen version in your dependency file
  2. Pin espeak-ng version: Record espeak-ng --version in experiment logs
  3. Use poetry.lock: Pins all transitive dependencies
  4. Record PHOIBLE version: Note the download date of ~/.corpusgen/phoible.csv

Citation

If you use corpusgen in your research, please cite:

@software{corpusgen2026,
  title={corpusgen: Language-Agnostic Speech Corpus Generation with Maximal Phoneme Coverage},
  author={Syed, Muntaser},
  year={2026},
  doi={10.5281/zenodo.18881479},
  url={https://github.com/jemsbhai/corpusgen}
}

License

Apache 2.0 — see LICENSE.

About

Generator and evaluator for speech corpora

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors