Language-agnostic framework for generating and evaluating speech corpora with maximal phoneme coverage.
corpusgen helps you build phonetically-balanced text corpora for speech synthesis (TTS), speech recognition (ASR), and clinical speech assessment — in any language.
-
Evaluate any text corpus for phoneme, diphone, or triphone coverage
-
PHOIBLE integration — phoneme inventories for 2,186 languages (3,020 inventories)
-
Grapheme-to-phoneme via espeak-ng for 100+ languages
-
Espeak ↔ PHOIBLE mapping — seamless bridge between G2P and phonological databases
-
Distribution quality metrics — Shannon entropy, normalized entropy, JSD (vs uniform or reference), Pearson correlation, coefficient of variation, PCD composite score
-
Coverage trajectory tracking — step-by-step coverage saturation curves for any selection or generation result
-
Text quality metrics — sentence length stats, vocabulary diversity (TTR, hapax ratio), Flesch readability scores
-
Error rate metrics — WER, CER, PER, SER with per-sentence breakdowns and corpus-level micro-averaging
-
Corpus-level perplexity — batched LM perplexity via GPT-2 (or any causal LM), both token-weighted corpus perplexity and sentence-weighted mean, with model sharing support
-
Structured reports — three verbosity levels, JSON export, JSON-LD-EX compatibility
-
40-language test suite — validated across 12 language families
-
6 selection algorithms for corpus optimization:
- Greedy Set Cover — ln(n)+1 approximation, the standard workhorse
- CELF — lazy evaluation speedup, identical results up to 700× faster
- Stochastic Greedy — (1-1/e-ε) approximation, scales to massive corpora
- ILP — exact optimal solutions via Integer Linear Programming (ground truth)
- Distribution-Aware — KL-divergence minimization for frequency matching
- NSGA-II — multi-objective Pareto optimization (coverage × cost × distribution)
-
Phoneme weighting — uniform, frequency-inverse, and linguistic class strategies
-
Phon-CTG generation framework — orchestrated corpus generation with pluggable backends:
- Repository backend — select from sentence pools (pre-phonemized, raw text, or HuggingFace datasets)
- LLM API backend — generate targeted sentences via OpenAI/Anthropic/Ollama (BYO API key)
- Local model backend — HuggingFace transformers with CUDA auto-detect and 4-bit/8-bit quantization
-
Phon-DATG — inference-time logit steering for phonetically-targeted local generation
-
Phon-RL — PPO-based policy fine-tuning with composite phonetic reward (custom implementation, no trl dependency)
-
Built-in scorers — n-gram phonotactic naturalness + LM perplexity fluency scoring
-
CLI —
corpusgen evaluate,corpusgen select,corpusgen inventory,corpusgen generatefrom the command line
corpusgen uses espeak-ng for grapheme-to-phoneme conversion. Install it before using corpusgen.
Windows
- Download the latest
.msiinstaller from espeak-ng releases - Run the installer (default path:
C:\Program Files\eSpeak NG\) - Set the environment variable so Python can find the shared library:
[Environment]::SetEnvironmentVariable("PHONEMIZER_ESPEAK_LIBRARY", "C:\Program Files\eSpeak NG\libespeak-ng.dll", "User")- Restart your terminal and verify:
espeak-ng --versionmacOS
brew install espeak-ngLinux (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y espeak-ngDocker / CI
RUN apt-get update && apt-get install -y espeak-ng && rm -rf /var/lib/apt/lists/*To use PHOIBLE phoneme inventories (2,186 languages), download the data on first use:
from corpusgen.inventory import PhoibleDataset
PhoibleDataset().download() # cached at ~/.corpusgen/phoible.csv (~24 MB)This only needs to be done once.
pip install corpusgengit clone https://github.com/jemsbhai/corpusgen.git
cd corpusgen
poetry install
poetry run pytestFor Phon-RL training and Phon-DATG logit steering with local models:
# 1. Install corpusgen with local model dependencies
poetry install --with local
# 2. IMPORTANT: Replace CPU torch with CUDA torch for GPU acceleration.
# The default Poetry install pulls CPU-only torch from PyPI.
# For NVIDIA GPUs (CUDA 12.1):
pip install torch --index-url https://download.pytorch.org/whl/cu121 --force-reinstall
# Verify GPU is available:
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"Note: Check pytorch.org/get-started for the correct CUDA version matching your driver. Common options:
cu118,cu121,cu124.
import corpusgen
report = corpusgen.evaluate(
["The quick brown fox jumps over the lazy dog.",
"She sells seashells by the seashore.",
"Pack my box with five dozen liquor jugs."],
language="en-us",
target_phonemes="phoible",
)
print(report.render())
print(report.coverage) # 0.65
print(report.missing_phonemes) # {'ʒ', 'ð', 'θ', ...}import corpusgen
candidates = [
"The quick brown fox jumps over the lazy dog.",
"She sells seashells by the seashore.",
"Peter Piper picked a peck of pickled peppers.",
"How much wood would a woodchuck chuck?",
"To be or not to be, that is the question.",
]
result = corpusgen.select_sentences(
candidates,
language="en-us",
algorithm="greedy", # or "celf", "stochastic", "ilp", "distribution", "nsga2"
)
print(f"Selected {result.num_selected} of {len(candidates)} sentences")
print(f"Coverage: {result.coverage:.1%}")The fastest way is the CLI:
# Select best sentences from a pool for maximal phoneme coverage
corpusgen generate -b repository -l en-us --file pool.txt --max-sentences 50
# With multi-objective scoring (coverage + phonotactic naturalness)
corpusgen generate -b repository -l en-us --file pool.txt \
--coverage-weight 0.7 --phonotactic-weight 0.3 --phonotactic-scorer ngramOr use the Python API for full control:
from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_ctg.scorer import PhoneticScorer
from corpusgen.generate.phon_ctg.loop import GenerationLoop, StoppingCriteria
from corpusgen.generate.backends.repository import RepositoryBackend
from corpusgen.g2p.manager import G2PManager
# 1. Phonemize a sentence pool
g2p = G2PManager()
sentences = ["The cat sat on the mat.", "Big dogs bark loudly.", ...]
results = g2p.phonemize_batch(sentences, language="en-us")
pool = [
{"text": s, "phonemes": r.phonemes}
for s, r in zip(sentences, results) if r.phonemes
]
# 2. Set up targets, scorer, and backend
targets = PhoneticTargetInventory(
target_phonemes=["p", "b", "t", "d", "k", "g"],
unit="phoneme",
)
scorer = PhoneticScorer(targets=targets, coverage_weight=1.0)
backend = RepositoryBackend(pool=pool)
# 3. Run the generation loop
loop = GenerationLoop(
backend=backend,
targets=targets,
scorer=scorer,
stopping_criteria=StoppingCriteria(
target_coverage=0.9,
max_sentences=20,
),
)
result = loop.run()
print(f"Generated {result.num_generated} sentences, coverage: {result.coverage:.1%}")Via CLI:
# Requires: poetry install --with llm
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini --max-sentences 20Or Python API:
from corpusgen.generate.backends.llm_api import LLMBackend
# Requires: poetry install --with llm
# Set your API key: export OPENAI_API_KEY=...
backend = LLMBackend(
model="gpt-4o-mini",
language="en-us",
)
# Use with the same GenerationLoop as above
loop = GenerationLoop(
backend=backend,
targets=targets,
scorer=scorer,
stopping_criteria=StoppingCriteria(target_coverage=0.9),
)
result = loop.run()from corpusgen.generate.phon_ctg.targets import PhoneticTargetInventory
from corpusgen.generate.phon_rl.reward import PhoneticReward
from corpusgen.generate.phon_rl.trainer import PhonRLTrainer, TrainingConfig
# Requires: poetry install --with local
# 1. Define targets and reward
targets = PhoneticTargetInventory(
target_phonemes=["p", "b", "t", "d", "k"],
unit="phoneme",
)
reward = PhoneticReward(targets=targets, coverage_weight=1.0)
# 2. Configure PPO training
config = TrainingConfig(
model_name="gpt2",
num_steps=100,
learning_rate=1e-5,
kl_coeff=0.1,
use_peft=True, # LoRA for parameter-efficient training
peft_r=8,
peft_alpha=16,
device=None, # auto-detect GPU
)
# 3. Train with dynamic prompts that adapt to coverage gaps
def make_prompt(targets):
missing = targets.next_targets(5)
return f"Write a sentence using these sounds: {', '.join(missing)}"
trainer = PhonRLTrainer(reward=reward, config=config)
result = trainer.train(prompt_fn=make_prompt)
print(f"Final coverage: {result.final_coverage:.1%}")
trainer.save_checkpoint("./phon_rl_checkpoint")from corpusgen import get_inventory
inv = get_inventory("en-us")
print(inv.language_name) # 'English'
print(inv.consonants) # ['p', 'b', 't', 'd', 'k', ...]
print(inv.vowels) # ['iː', 'ɪ', 'ɛ', 'æ', ...]
# Query by distinctive features
nasals = inv.segments_with_feature("nasal", "+")
print([s.phoneme for s in nasals]) # ['m', 'n', 'ŋ']import corpusgen
report = corpusgen.evaluate(
["The quick brown fox jumps."],
language="en-us",
target_phonemes="phoible",
unit="diphone",
)
print(f"Diphone coverage: {report.coverage:.1%}")import corpusgen
report = corpusgen.evaluate(
["The quick brown fox."],
language="en-us",
target_phonemes="phoible",
)
# JSON
print(report.to_json(indent=2))
# JSON-LD (linked data)
doc = report.to_jsonld_ex()
# Human-readable at different verbosity levels
from corpusgen.evaluate.report import Verbosity
print(report.render(verbosity=Verbosity.MINIMAL))
print(report.render(verbosity=Verbosity.NORMAL))
print(report.render(verbosity=Verbosity.VERBOSE))import corpusgen
report = corpusgen.evaluate(
["The cat sat on the mat.", "Big dogs dig deep holes."],
language="en-us",
target_phonemes="phoible",
)
# Distribution metrics are auto-computed
dm = report.distribution
print(f"Normalized entropy: {dm.normalized_entropy:.4f}") # 1.0 = perfectly uniform
print(f"JSD vs uniform: {dm.jsd_uniform:.6f}") # 0.0 = perfectly uniform
print(f"PCD (uniform): {dm.pcd_uniform:.4f}") # coverage × (1 - JSD)
# Compare against a natural language reference distribution
from corpusgen.evaluate.distribution import compute_distribution_metrics
reference = {"p": 0.04, "t": 0.07, "k": 0.03, "ə": 0.12} # example frequencies
dm_ref = compute_distribution_metrics(
report.phoneme_counts, report.target_phonemes, reference_distribution=reference
)
print(f"JSD vs reference: {dm_ref.jsd_reference:.6f}")
print(f"Pearson correlation: {dm_ref.pearson_correlation}")from corpusgen.evaluate.trajectory import compute_coverage_trajectory
# From a SelectionResult
traj = compute_coverage_trajectory(
[candidate_phonemes[i] for i in result.selected_indices],
target_units=result.covered_units | result.missing_units,
unit=result.unit,
)
# Easy plotting
import matplotlib.pyplot as plt
plt.plot(range(len(traj.coverages)), traj.coverages)
plt.xlabel("Sentences")
plt.ylabel("Coverage")
plt.title("Coverage Saturation Curve")
plt.show()
# Access marginal gains per sentence
print(traj.gains) # [5, 3, 2, 1, 1, 0, ...]import corpusgen
report = corpusgen.evaluate(
["The cat sat on the mat.", "Big dogs dig deep holes."],
language="en-us",
)
# Text quality metrics are auto-computed
tq = report.text_quality
print(f"Type-Token Ratio: {tq.type_token_ratio:.3f}")
print(f"Flesch Reading Ease: {tq.flesch_reading_ease:.1f}")
print(f"Avg sentence length: {tq.sentence_length_words_mean:.1f} words")from corpusgen.evaluate.perplexity import compute_corpus_perplexity
# Simple — loads GPT-2 automatically (requires: poetry install --with local)
metrics = compute_corpus_perplexity(
["The cat sat on the mat.", "Big dogs dig deep holes."],
model_name="gpt2",
)
print(f"Corpus perplexity: {metrics.corpus_perplexity:.2f}") # token-weighted (standard LM metric)
print(f"Mean perplexity: {metrics.mean_perplexity:.2f}") # sentence-weighted
print(f"Median: {metrics.median_perplexity:.2f}")
print(f"Total tokens: {metrics.num_tokens}")
# Per-sentence breakdown
for i, ppl in enumerate(metrics.per_sentence):
print(f" Sentence {i}: PPL = {ppl:.2f}")
# Shared model — avoids loading the same model twice when you are
# also using PerplexityFluencyScorer during generation:
from corpusgen.generate.scorers.fluency import PerplexityFluencyScorer
scorer = PerplexityFluencyScorer(model_name="gpt2", device="cuda")
scorer("warm-up call to trigger lazy load")
metrics = compute_corpus_perplexity(
sentences,
model=scorer._model,
tokenizer=scorer._tokenizer,
)from corpusgen.evaluate.error_rates import compute_error_rates
result = compute_error_rates(
references=["the cat sat on the mat", "big dogs dig deep holes"],
hypotheses=["the cat sat on a mat", "big dog dig deep hole"],
)
print(f"WER: {result.wer:.2%}") # corpus-level, micro-averaged
print(f"CER: {result.cer:.2%}")
print(f"SER: {result.ser:.2%}")
# With phoneme-level comparison
result = compute_error_rates(
references=["the cat"],
hypotheses=["a cat"],
reference_phonemes=[["\u00f0", "\u0259", "k", "\u00e6", "t"]],
hypothesis_phonemes=[["\u0259", "k", "\u00e6", "t"]],
)
print(f"PER: {result.per:.2%}")
# Per-sentence breakdown
for d in result.details:
print(f" [{d.index}] WER={d.wer:.2%} CER={d.cer:.2%}")# Show PHOIBLE phoneme inventory for a language
corpusgen inventory --language en-us
corpusgen inventory --language fr-fr --format json
corpusgen inventory --language en-us --source upsid
# Evaluate a corpus for phoneme coverage
corpusgen evaluate "The cat sat on the mat." --language en-us
corpusgen evaluate --file corpus.txt --language en-us --target phoible
corpusgen evaluate --file corpus.txt -l en-us --unit diphone --format json
corpusgen evaluate --file corpus.txt -l en-us --verbosity verbose
# Select optimal sentences from a candidate pool
corpusgen select --file candidates.txt --language en-us
corpusgen select -f pool.txt -l en-us --algorithm celf --max-sentences 50
corpusgen select -f pool.txt -l en-us --target phoible --target-coverage 0.95
corpusgen select -f pool.txt -l en-us --output selected.txt --format json
# Generate sentences targeting phoneme coverage
# --- Repository backend (sentence pool) ---
corpusgen generate -b repository -l en-us --file pool.txt --max-sentences 50
corpusgen generate -b repository -l en-us --file pool.txt --unit diphone --format json
corpusgen generate -b repository -l en-us --file pool.txt --phonemes "ʃ,ʒ,θ" --weights "ʃ:2.0,θ:1.5"
corpusgen generate -b repository -l en-us --file pool.txt --output generated.txt
# --- Repository backend with HuggingFace dataset ---
corpusgen generate -b repository -l en-us --dataset wikitext --split train --max-samples 1000
# --- LLM API backend (requires API key) ---
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini --max-sentences 20
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini --api-key sk-... --llm-temperature 0.9
# --- Local model backend (requires torch) ---
corpusgen generate -b local -l en-us --model gpt2 --device cuda --max-sentences 30
corpusgen generate -b local -l en-us --model gpt2 --quantization 4bit --local-temperature 0.7
# --- With built-in scorers (multi-objective candidate ranking) ---
corpusgen generate -b repository -l en-us --file pool.txt \
--coverage-weight 0.6 \
--phonotactic-weight 0.3 --phonotactic-scorer ngram \
--fluency-weight 0.1 --fluency-scorer perplexity --fluency-model gpt2
# --- With corpus-trained phonotactic model ---
corpusgen generate -b repository -l en-us --file pool.txt \
--phonotactic-weight 0.3 --phonotactic-scorer ngram \
--phonotactic-corpus reference.txt --phonotactic-n 3
# --- With guidance strategies (local backend only) ---
corpusgen generate -b local -l en-us --model gpt2 --guidance datg --datg-boost 5.0
corpusgen generate -b local -l en-us --model gpt2 --guidance rl --rl-adapter-path ./checkpoint
corpusgen generate -b local -l en-us --model gpt2 --guidance datg --guidance-config datg.json
# --- Custom prompt templates ---
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini \
--prompt-template "Write {k} English sentences containing: {target_units}"
corpusgen generate -b llm_api -l en-us --model openai/gpt-4o-mini \
--prompt-template prompt.txtcorpusgen/
├── cli/ # Command-line interface
│ ├── evaluate.py # corpusgen evaluate
│ ├── generate.py # corpusgen generate
│ ├── inventory.py # corpusgen inventory
│ └── select.py # corpusgen select
├── g2p/ # Grapheme-to-phoneme conversion
│ ├── manager.py # G2PManager — multi-backend G2P (espeak-ng)
│ └── result.py # G2PResult — phonemes, diphones, triphones
├── coverage/
│ └── tracker.py # CoverageTracker — phoneme/diphone/triphone tracking
├── evaluate/
│ ├── evaluate.py # evaluate() — top-level API
│ ├── report.py # EvaluationReport, Verbosity
│ ├── distribution.py # DistributionMetrics — JSD, entropy, PCD, Pearson
│ ├── trajectory.py # CoverageTrajectory — step-by-step saturation curves
│ ├── text_quality.py # TextQualityMetrics — TTR, readability, sentence stats
│ ├── error_rates.py # WER, CER, PER, SER with edit distance
│ └── perplexity.py # Corpus-level perplexity (batched, GPU-accelerated)
├── inventory/
│ ├── models.py # Segment (38 features), Inventory
│ ├── phoible.py # PhoibleDataset — PHOIBLE loader/cache/query
│ └── mapping.py # EspeakMapping — espeak ↔ ISO 639-3
├── select/
│ ├── greedy.py # GreedySelector
│ ├── celf.py # CELFSelector (lazy evaluation)
│ ├── stochastic.py # StochasticGreedySelector
│ ├── ilp.py # ILPSelector (exact, optional: pulp)
│ ├── distribution.py # DistributionAwareSelector (KL-divergence)
│ └── nsga2.py # NSGA2Selector (Pareto, optional: pymoo)
├── weights/ # Phoneme weighting strategies
├── generate/
│ ├── phon_ctg/ # Orchestration framework
│ │ ├── targets.py # PhoneticTargetInventory
│ │ ├── scorer.py # PhoneticScorer (coverage + phonotactic + fluency)
│ │ ├── constraints.py # PhonotacticConstraint ABC + N-gram model
│ │ └── loop.py # GenerationLoop + StoppingCriteria
│ ├── scorers/ # Built-in scoring functions
│ │ ├── phonotactic.py # NgramPhonotacticScorer (save/load, corpus-trained)
│ │ └── fluency.py # PerplexityFluencyScorer (lazy LM, model sharing)
│ ├── phon_rl/ # RL-based guidance (PPO)
│ │ ├── reward.py # PhoneticReward (composite, hierarchical)
│ │ ├── trainer.py # PhonRLTrainer (custom PPO, no trl)
│ │ ├── policy.py # PhonRLStrategy (GuidanceStrategy wrapper)
│ │ └── value_head.py # ValueHead (nn.Module for GAE)
│ ├── phon_datg/ # Inference-time logit steering
│ │ ├── attribute_words.py # Vocabulary phonemization + index
│ │ ├── modulator.py # Additive logit modulation
│ │ └── graph.py # DATGStrategy (GuidanceStrategy)
│ ├── guidance.py # GuidanceStrategy ABC
│ └── backends/ # Pluggable generation engines
│ ├── repository.py # Sentence pool selection + HuggingFace datasets
│ ├── llm_api.py # Multi-provider LLM API (litellm)
│ └── local.py # HuggingFace transformers + quantization
corpusgen supports any language available in both espeak-ng and PHOIBLE:
- G2P (espeak-ng): 100+ languages
- Inventories (PHOIBLE): 2,186 languages, 3,020 inventories, 8 sources
- Tested across: 40 languages, 12 language families, 10+ scripts
The espeak-to-PHOIBLE mapping covers 85+ languages with automatic macrolanguage resolution (e.g., ms → Standard Malay, sw → Swahili).
For reproducible results across machines:
- Pin corpusgen version in your dependency file
- Pin espeak-ng version: Record
espeak-ng --versionin experiment logs - Use
poetry.lock: Pins all transitive dependencies - Record PHOIBLE version: Note the download date of
~/.corpusgen/phoible.csv
If you use corpusgen in your research, please cite:
@software{corpusgen2026,
title={corpusgen: Language-Agnostic Speech Corpus Generation with Maximal Phoneme Coverage},
author={Syed, Muntaser},
year={2026},
doi={10.5281/zenodo.18881479},
url={https://github.com/jemsbhai/corpusgen}
}Apache 2.0 — see LICENSE.