SpeeChain is an open-source PyTorch-based speech and language processing toolkit initiated by the AHC lab at Nara Institute of Science and Technology (NAIST). This toolkit is designed to simplify the pipeline of the research on the machine speech chain, i.e., the joint model of automatic speech recognition (ASR) and text-to-speech synthesis (TTS).
SpeeChain is currently in beta. Contribution to this toolkit is warmly welcomed anywhere, anytime!
If you find our toolkit helpful for your research, we sincerely hope that you can give us a star⭐! Anytime you encounter problems when using our toolkit, please don't hesitate to leave us an issue!
- Offline TTS→ASR Chain
Below are the simple, most important features that SpeeChain can do. You may check the DeepWiki to see details about SpeeChain that is generated by AI (Devin).
- Data Processing:
- On-the-fly Log-Mel Spectrogram Extraction
- On-the-fly SpecAugment
- On-the-fly Feature Normalization
- Model Training:
- Multi-GPU Model Distribution based on torch.nn.parallel.DistributedDataParallel
- Real-time status reporting by online Tensorboard and offline Matplotlib
- Real-time learning dynamics visualization (attention visualization, spectrogram visualization)
- Data Loading:
- On-the-fly mixture of multiple datasets in a single dataloader
- On-the-fly data selection for each dataloader to filter the undesired data samples.
- Multi-dataloader batch generation to form training batches by multiple datasets.
- Optimization:
- Model training can be done by multiple optimizers. Each optimizer is responsible for a specific part of model parameters.
- Gradient accumulation for mimicking the large-batch gradients by the ones on several small batches.
- Easy-to-set fine-tuning factor to scale down the learning rates without any modification of the scheduler configuration.
- Model Evaluation:
- Multi-level .md evaluation reports (overall-level, group-level model, and sample-level) without any layout misplacement.
- Histogram visualization for the distribution of evaluation metrics
- TopN bad case analysis for better model diagnosis.
👆Back to the table of contents
The simplest recipe is Mini LibriSpeech. It takes about 2 hours to train a model on a single GPU.
We recommend you first install Anaconda into your machine before using our toolkit. After the installation of Anaconda, please follow the steps below to deploy our toolkit on your machine:
- Find a path with enough disk memory space. (e.g., at least 500GB if you want to use LibriSpeech or LibriTTS datasets).
- Clone our toolkit by
git clone https://github.com/bagustris/SpeeChain.git. - Go to the root path of our toolkit by
cd SpeeChain. - Run
source envir_preparation.shto build the environment for SpeeChain toolkit. After execution, a virtual environment namedspeechainwill be created and two environmental variablesSPEECHAIN_ROOTandSPEECHAIN_PYTHONwill be initialized in your~/.bashrc.
Note: It must be executed in the root pathSpeeChainand by the commandsourcerather than./envir_preparation.sh. - Run
conda activate speechainin your terminal to examine the installation of Conda environment. If the environmentspeechainis not successfully activated, please runconda env create -f environment.yaml,conda activate speechainandpip install -e ./to manually install it. - Run
echo ${SPEECHAIN_ROOT}andecho ${SPEECHAIN_PYTHON}in your terminal to examine the environmental variables. If either one is empty, please manually add them into your~/.bashrcbyexport SPEECHAIN_ROOT=xxxorexport SPEECHAIN_PYTHON=xxxand then activate them bysource ~/.bashrc.SPEECHAIN_ROOTshould be the absolute path of theSpeeChainfolder you have just cloned (i.e./xxx/SpeeChainwhere/xxx/is the parent directory);SPEECHAIN_PYTHONshould be the absolute path of the python compiler in the folder ofspeechainenvironment (i.e./xxx/anaconda3/envs/speechain/bin/python3.Xwhere/xxx/is where youranaconda3is placed andXdepends onenvironment.yaml).
- Read the handbook and start your journey in SpeeChain!
The original implementation of this repo can be referred to heli-qi/speechain. If you are using this toolkit, please cite the reference below:
Qi, H., Novitasari, S., Tjandra, A., Sakti, S., & Nakamura, S. (2023). SpeeChain: A Speech Toolkit for Large-Scale Machine Speech Chain. http://arxiv.org/abs/2301.02966