Skip to content

MaximDehtear/audio-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RU Audio → Dataset Converter (Scaffold)

CLI-oriented pipeline to convert WAV/MP3 to mono 44.1 kHz, peak-normalize to -1 dBFS, VAD-chunk 5–30 s (overlap presets 150/200/250 ms; default 200), run Russian faster-whisper with punctuation + ITN (numbers→words; dates; times; distances; units: °C, %, km/h, km, m, cm, mm, kWh, mg/l, rpm, psi, l/100km, bar, dB, GHz, currency, kPa, lux, volts, amps), sentence-align, and export LJSpeech/F5-TTS artifacts under ./dataset with pipe-delimited metadata.csv lines wavs/utt000001.wav|<text> plus dataset/audio|text. QC warns (does not drop) outside 5–30 s.

Environment (Conda)

conda env create -f env/conda.yml
conda activate audio-dataset

ffmpeg is included in the conda environment. If you need a system install:

Ubuntu

sudo apt update
sudo apt install -y ffmpeg
ffmpeg -version

TorchAudio 2.9+: backend setters/getters (get_audio_backend, set_audio_backend, list_audio_backends) were removed; it now dispatches automatically to available codecs. We load via soundfile first, then plain torchaudio.load.

If you hit a TorchCodec/libtorchcodec load error, ensure FFmpeg is present (e.g., conda install -c conda-forge ffmpeg) and matches the torch/torchcodec compatibility table.

Windows (scoop)

scoop install ffmpeg
ffmpeg -version

Windows (chocolatey)

choco install ffmpeg -y
ffmpeg -version

Minimum requirements

  • OS: Linux (tested Ubuntu 24.x); Windows may work with ffmpeg/torch installed.
  • Python: 3.11 (matches env/conda.yml).
  • GPU: NVIDIA RTX-class with CUDA 12.4 runtime; large-v3 prefers ≥16 GB VRAM (RTX 5090 is sufficient). For smaller GPUs use a smaller faster-whisper model.
  • CPU/RAM: modern CPU; ≥16 GB RAM recommended for faster-whisper and resampling.
  • ffmpeg available on PATH (provided by conda env).

RTX 50xx (e.g., RTX 5090) CUDA builds

  • Stable install with bundled cuDNN (CUDA 12.8 wheels):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
  • If you hit missing cuDNN symbols (e.g., libcudnn_ops.so), install cuDNN and export its path:
pip install nvidia-cudnn-cu12
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH
  • RTX 5090 (sm_120) note: some stable cu128 builds may not ship sm_120; if so, the CLI will fall back to CPU for ASR/resampling. To try GPU, install a cu128 nightly that lists sm_120 support; otherwise run with --asr-device cpu --compute-type int8.
  • Vision is not required for this project; omit torchvision to avoid version pinning issues.
  • After upgrading the PyTorch stack, update ctranslate2 (faster-whisper backend) so it picks up the new SM:
pip install --upgrade --pre ctranslate2
  • If the GPU remains unsupported, the CLI will log it and automatically fall back to CPU for resampling and ASR.

Usage (planned CLI)

  • Build dataset end-to-end:
python -m scripts.cli build-dataset \
  --input /path/to/audio_dir \
  --output ./dataset \
  --model large-v3 \
  --resample-device gpu \ # optional; default cpu
  --asr-device auto \     # auto|cuda|cpu
  --compute-type float16 \# on cpu use int8 for stability
  --overlap 200 \
  --min-sec 5 \
  --max-sec 30
  • CPU-only fallback (avoids missing cuDNN):
python -m scripts.cli build-dataset \
  --input /path/to/audio_dir \
  --output ./dataset \
  --model large-v3 \
  --resample-device cpu \
  --asr-device cpu \
  --compute-type int8 \
  --overlap 200 \
  --min-sec 5 \
  --max-sec 30
  • Convert only (audio prep):
python -m scripts.cli convert --input /path/to/audio_dir --output ./dataset
  • Transcribe only:
python -m scripts.cli transcribe --input ./dataset/wavs --output ./dataset
  • QC only:
python -m scripts.cli qc --input ./dataset

Outputs

  • ./dataset/wavs/*.wav — mono 44.1 kHz, peak -1 dBFS, VAD-chunked.
  • ./dataset/text/*.txt — per-utterance text (always emitted, even if alignment returns no spans).
  • ./dataset/metadata.csv — pipe-delimited lines wavs/utt000001.wav|<text> (always written when chunks are processed).

Example lines:

wavs/utt000001.wav|привет мир
wavs/utt000002.wav|сегодня минус пять градусов цельсия

Defaults (configs/default.yaml)

  • SR 44.1 kHz mono; peak -1 dBFS.
  • VAD chunk 5–30 s; overlap presets 150/200/250 ms (default 200).
  • ASR: faster-whisper RU large-v3, compute_type float16, beam_size 5.
  • ASR device flag: --asr-device auto|cuda|cpu (default auto). If CPU is used, prefer --compute-type int8.
  • ITN units: °C, %, km/h, km, m, cm, mm, kWh, mg/l, rpm, psi, l/100km, bar, dB, GHz, currency, kPa, lux, volts, amps.
  • QC: warn-only for clips outside 5–30 s; true-peak guard -0.5 dB (check only).
  • Manifest: pipe, relative paths wavs/<id>.wav|<text>, IDs like utt000001 (zero_pad=6).

Notes

  • Cache faster-whisper models under ~/.cache/faster-whisper unless overridden.
  • Punctuation preserved from ASR; ITN converts numbers/units to spoken Russian forms.
  • Uses only free/local tooling (ffmpeg, silero-vad, faster-whisper, torchaudio/librosa/pydub, soundfile, num2words, razdel).

Recent updates

  • build-dataset now prints flushed progress (config summary, ASR model load, per-file and per-chunk steps) so you can see it is running.
  • Fails fast if the input path has no supported audio files.
  • QC peak guard now uses proper dB math, reducing false peak warnings.
  • Optional GPU resampling: --resample-device gpu (falls back to CPU if CUDA unavailable); default remains CPU.
  • ASR device selection checks CUDA arch support; if the installed torch build lacks your GPU's SM it will fall back to CPU and log why.
  • Added --asr-device flag (auto|cuda|cpu). When running ASR on CPU, use --compute-type int8 to avoid float16 crashes. Documented CPU-only run example.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages