This is a repository for the paper, VoiceLDM: Text-to-Speech with Environmental Context, ICASSP 2024.
VoiceLDM is an extension of text-to-audio models so that it is also capable of generating linguistically intelligible speech.
[2024/05 Update] I have now added the code for training VoiceLDM! Refer to Training for more details.
pip install git+https://github.com/glory20h/VoiceLDM.gitOR
git clone https://github.com/glory20h/VoiceLDM.git
cd VoiceLDM
pip install -e .- Generate audio with description prompt and content prompt:
python generate.py --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?"- Generate audio with audio prompt and content prompt:
python generate.py --audio_prompt "whispering.wav" --cont_prompt "Good morning! How are you feeling today?"- Text-to-Speech Example:
python generate.py --desc_prompt "clean speech" --cont_prompt "Good morning! How are you feeling today?" --desc_guidance_scale 1 --cont_guidance_scale 9- Text-to-Audio Example:
python generate.py --desc_prompt "trumpet" --cont_prompt "_" --desc_guidance_scale 9 --cont_guidance_scale 1Generated audios will be saved at the default output folder ./outputs.
It's crucial to appropriately adjust the weights for dual classifier-free guidance. We find that this adjustment greatly influences the likelihood of obtaining satisfactory results. Here are some key tips:
-
Some weight settings are more effective for different prompts. Experiment with the weights and find the ideal combination that suits the specific use case.
-
Starting with 7 for both
desc_guidance_scaleandcont_guidance_scaleis a good starting point. -
If you feel that the generated audio doesn't align well with the provided content prompt, try decreasing the
desc_guidance_scaleand increase thecont_guidance_scale. -
If you feel that the generated audio doesn't align well with the provided description prompt, try decreasing the
cont_guidance_scaleand increase thedesc_guidance_scale.
View the full list of options with the following command:
python generate.py -hThe CSV files for the processed dataset used to train VoiceLDM can be found in here. These files include the transcriptions generated using the Whisper model.
as_speech_en.csv(English speech segments from AudioSet)cv1.csv(English speech segments from CommonVoice 13.0 en, it has been split into two to meet the file size limitations on GitHub.)cv2.csvvoxceleb.csv(English speech segments from VoxCeleb1)
as_noise.csv(Non-speech segments from AudioSet)noise_demand.csv(Non-speech segments from DEMAND)
If you wish to train the model by yourself, follow these steps:
-
Configuration Setup (The trickiest part):
- Navigate to the
configsfolder to find the necessary configuration files. For example,VoiceLDM-M.yamlis used for training the VoiceLDM-M model in the paper. - Prepare the CSV files used for training. You can download it here.
- Examine the YAML file and adjust the
"paths"and"noise_paths"to the root path of your dataset. Also, take a look at the CSV files and ensure that thefile_pathin these CSV files match the actual file path names in your dataset. - Update the paths for
cv_csv_path1,cv_csv_path2,as_speech_en_csv_path,voxceleb_csv_path,as_noise_csv_path, andnoise_demand_csv_pathin the YAML file. You may optionally leave it blank if you do not wish to use the corresponding csv file and training data. - You may also adjust other parameters such as the batch size according to your system's capabilities.
- Navigate to the
-
Configure Huggingface Accelerate:
- Set up Accelerate by running:
This will allow support of CPU, single GPU, and multi-GPU training. Follow the on-screen instructions to configure your hardware settings.
accelerate config
- Set up Accelerate by running:
-
Start Training:
- Launch the training process with the following example command:
accelerate launch train.py --config config/VoiceLDM-M.yaml
- Training checkpoints will be automatically saved in the
resultsfolder.
- Launch the training process with the following example command:
-
Running Inference:
- Once training is complete, you can perform inference using the trained model by specifying the checkpoint path. For example:
python generate.py --ckpt_path results/VoiceLDM-M/checkpoints/checkpoint_49/pytorch_model.bin --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?"
- Once training is complete, you can perform inference using the trained model by specifying the checkpoint path. For example:
This work would not have been possible without the following repositories:
