Skip to content

bagustris/speechain

 
 

Repository files navigation

SpeeChain: A PyTorch-based Machine Speech Chain Toolkit for ASR and TTS

SpeeChain is an open-source PyTorch-based speech and language processing toolkit initiated by the AHC lab at Nara Institute of Science and Technology (NAIST). This toolkit is designed to simplify the pipeline of the research on the machine speech chain, i.e., the joint model of automatic speech recognition (ASR) and text-to-speech synthesis (TTS).

SpeeChain is currently in beta. Contribution to this toolkit is warmly welcomed anywhere, anytime!

If you find our toolkit helpful for your research, we sincerely hope that you can give us a star⭐! Anytime you encounter problems when using our toolkit, please don't hesitate to leave us an issue!

Table of Contents

  1. Machine Speech Chain
  2. Toolkit Characteristics
  3. Get a Quick Start

Machine Speech Chain

  • Offline TTS→ASR Chain

Toolkit Characteristics

Below are the simple, most important features that SpeeChain can do. You may check the DeepWiki to see details about SpeeChain that is generated by AI (Devin).

  • Data Processing:
    • On-the-fly Log-Mel Spectrogram Extraction
    • On-the-fly SpecAugment
    • On-the-fly Feature Normalization
  • Model Training:
    • Multi-GPU Model Distribution based on torch.nn.parallel.DistributedDataParallel
    • Real-time status reporting by online Tensorboard and offline Matplotlib
    • Real-time learning dynamics visualization (attention visualization, spectrogram visualization)
  • Data Loading:
    • On-the-fly mixture of multiple datasets in a single dataloader
    • On-the-fly data selection for each dataloader to filter the undesired data samples.
    • Multi-dataloader batch generation to form training batches by multiple datasets.
  • Optimization:
    • Model training can be done by multiple optimizers. Each optimizer is responsible for a specific part of model parameters.
    • Gradient accumulation for mimicking the large-batch gradients by the ones on several small batches.
    • Easy-to-set fine-tuning factor to scale down the learning rates without any modification of the scheduler configuration.
  • Model Evaluation:
    • Multi-level .md evaluation reports (overall-level, group-level model, and sample-level) without any layout misplacement.
    • Histogram visualization for the distribution of evaluation metrics
    • TopN bad case analysis for better model diagnosis.

👆Back to the table of contents

Get a Quick Start

The simplest recipe is Mini LibriSpeech. It takes about 2 hours to train a model on a single GPU.

We recommend you first install Anaconda into your machine before using our toolkit. After the installation of Anaconda, please follow the steps below to deploy our toolkit on your machine:

  1. Find a path with enough disk memory space. (e.g., at least 500GB if you want to use LibriSpeech or LibriTTS datasets).
  2. Clone our toolkit by git clone https://github.com/bagustris/SpeeChain.git.
  3. Go to the root path of our toolkit by cd SpeeChain.
  4. Run source envir_preparation.sh to build the environment for SpeeChain toolkit. After execution, a virtual environment named speechain will be created and two environmental variables SPEECHAIN_ROOT and SPEECHAIN_PYTHON will be initialized in your ~/.bashrc.
    Note: It must be executed in the root path SpeeChain and by the command source rather than ./envir_preparation.sh.
  5. Run conda activate speechain in your terminal to examine the installation of Conda environment. If the environment speechain is not successfully activated, please run conda env create -f environment.yaml, conda activate speechain and pip install -e ./ to manually install it.
  6. Run echo ${SPEECHAIN_ROOT} and echo ${SPEECHAIN_PYTHON} in your terminal to examine the environmental variables. If either one is empty, please manually add them into your ~/.bashrc by export SPEECHAIN_ROOT=xxx or export SPEECHAIN_PYTHON=xxx and then activate them by source ~/.bashrc.
    1. SPEECHAIN_ROOT should be the absolute path of the SpeeChain folder you have just cloned (i.e. /xxx/SpeeChain where /xxx/ is the parent directory);
    2. SPEECHAIN_PYTHON should be the absolute path of the python compiler in the folder of speechain environment (i.e. /xxx/anaconda3/envs/speechain/bin/python3.X where /xxx/ is where your anaconda3 is placed and X depends on environment.yaml).
  7. Read the handbook and start your journey in SpeeChain!

Citation

The original implementation of this repo can be referred to heli-qi/speechain. If you are using this toolkit, please cite the reference below:

Qi, H., Novitasari, S., Tjandra, A., Sakti, S., & Nakamura, S. (2023). SpeeChain: A Speech Toolkit for Large-Scale Machine Speech Chain. http://arxiv.org/abs/2301.02966

About

Toolkit for ASR, TTS, and both simultaneously

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 84.5%
  • Shell 15.5%