CLI-oriented pipeline to convert WAV/MP3 to mono 44.1 kHz, peak-normalize to -1 dBFS, VAD-chunk 5–30 s (overlap presets 150/200/250 ms; default 200), run Russian faster-whisper with punctuation + ITN (numbers→words; dates; times; distances; units: °C, %, km/h, km, m, cm, mm, kWh, mg/l, rpm, psi, l/100km, bar, dB, GHz, currency, kPa, lux, volts, amps), sentence-align, and export LJSpeech/F5-TTS artifacts under ./dataset with pipe-delimited metadata.csv lines wavs/utt000001.wav|<text> plus dataset/audio|text. QC warns (does not drop) outside 5–30 s.
conda env create -f env/conda.yml
conda activate audio-datasetffmpeg is included in the conda environment. If you need a system install:
sudo apt update
sudo apt install -y ffmpeg
ffmpeg -versionTorchAudio 2.9+: backend setters/getters (get_audio_backend, set_audio_backend, list_audio_backends) were removed; it now dispatches automatically to available codecs. We load via soundfile first, then plain torchaudio.load.
If you hit a TorchCodec/libtorchcodec load error, ensure FFmpeg is present (e.g., conda install -c conda-forge ffmpeg) and matches the torch/torchcodec compatibility table.
scoop install ffmpeg
ffmpeg -versionchoco install ffmpeg -y
ffmpeg -version- OS: Linux (tested Ubuntu 24.x); Windows may work with ffmpeg/torch installed.
- Python: 3.11 (matches
env/conda.yml). - GPU: NVIDIA RTX-class with CUDA 12.4 runtime;
large-v3prefers ≥16 GB VRAM (RTX 5090 is sufficient). For smaller GPUs use a smaller faster-whisper model. - CPU/RAM: modern CPU; ≥16 GB RAM recommended for faster-whisper and resampling.
- ffmpeg available on PATH (provided by conda env).
- Stable install with bundled cuDNN (CUDA 12.8 wheels):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128- If you hit missing cuDNN symbols (e.g.,
libcudnn_ops.so), install cuDNN and export its path:
pip install nvidia-cudnn-cu12
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH- RTX 5090 (sm_120) note: some stable cu128 builds may not ship sm_120; if so, the CLI will fall back to CPU for ASR/resampling. To try GPU, install a cu128 nightly that lists sm_120 support; otherwise run with
--asr-device cpu --compute-type int8. - Vision is not required for this project; omit
torchvisionto avoid version pinning issues. - After upgrading the PyTorch stack, update ctranslate2 (faster-whisper backend) so it picks up the new SM:
pip install --upgrade --pre ctranslate2- If the GPU remains unsupported, the CLI will log it and automatically fall back to CPU for resampling and ASR.
- Build dataset end-to-end:
python -m scripts.cli build-dataset \
--input /path/to/audio_dir \
--output ./dataset \
--model large-v3 \
--resample-device gpu \ # optional; default cpu
--asr-device auto \ # auto|cuda|cpu
--compute-type float16 \# on cpu use int8 for stability
--overlap 200 \
--min-sec 5 \
--max-sec 30- CPU-only fallback (avoids missing cuDNN):
python -m scripts.cli build-dataset \
--input /path/to/audio_dir \
--output ./dataset \
--model large-v3 \
--resample-device cpu \
--asr-device cpu \
--compute-type int8 \
--overlap 200 \
--min-sec 5 \
--max-sec 30- Convert only (audio prep):
python -m scripts.cli convert --input /path/to/audio_dir --output ./dataset- Transcribe only:
python -m scripts.cli transcribe --input ./dataset/wavs --output ./dataset- QC only:
python -m scripts.cli qc --input ./dataset./dataset/wavs/*.wav— mono 44.1 kHz, peak -1 dBFS, VAD-chunked../dataset/text/*.txt— per-utterance text (always emitted, even if alignment returns no spans)../dataset/metadata.csv— pipe-delimited lineswavs/utt000001.wav|<text>(always written when chunks are processed).
Example lines:
wavs/utt000001.wav|привет мир
wavs/utt000002.wav|сегодня минус пять градусов цельсия
- SR 44.1 kHz mono; peak -1 dBFS.
- VAD chunk 5–30 s; overlap presets 150/200/250 ms (default 200).
- ASR: faster-whisper RU
large-v3, compute_type float16, beam_size 5. - ASR device flag:
--asr-device auto|cuda|cpu(default auto). If CPU is used, prefer--compute-type int8. - ITN units: °C, %, km/h, km, m, cm, mm, kWh, mg/l, rpm, psi, l/100km, bar, dB, GHz, currency, kPa, lux, volts, amps.
- QC: warn-only for clips outside 5–30 s; true-peak guard -0.5 dB (check only).
- Manifest: pipe, relative paths
wavs/<id>.wav|<text>, IDs likeutt000001(zero_pad=6).
- Cache faster-whisper models under
~/.cache/faster-whisperunless overridden. - Punctuation preserved from ASR; ITN converts numbers/units to spoken Russian forms.
- Uses only free/local tooling (ffmpeg, silero-vad, faster-whisper, torchaudio/librosa/pydub, soundfile, num2words, razdel).
build-datasetnow prints flushed progress (config summary, ASR model load, per-file and per-chunk steps) so you can see it is running.- Fails fast if the input path has no supported audio files.
- QC peak guard now uses proper dB math, reducing false peak warnings.
- Optional GPU resampling:
--resample-device gpu(falls back to CPU if CUDA unavailable); default remains CPU. - ASR device selection checks CUDA arch support; if the installed torch build lacks your GPU's SM it will fall back to CPU and log why.
- Added
--asr-deviceflag (auto|cuda|cpu). When running ASR on CPU, use--compute-type int8to avoid float16 crashes. Documented CPU-only run example.