- download the dataset
- create the folder named "train_audio_<eng/ger/per>/<IEMOCAP/EMODB/SHEMO>"
- run the Data_Processing.ipynb to reorganize the datasets to the following structure:
- train_audio_<eng/ger/per>
- <IEMOCAP/EMODB/SHEMO>
- emotion A
- wav
- audio A_1
- audio A_2
- emotion B
- wav
- audio B_1
- audio B_2
...
- run the command
python encoder_preprocess.py ./train_audio_<eng/ger/per>
- The output folder is named SV2TTS, which can also be found in the drive:
Note: In order to lost the data during processing, I changed the partials_n_frames in encoder/params_data.py. Currently don't know what would be affected.
- will need to download the
visdompackage to visualize the training process. And start the visdom server before running the command. - change the parameter
speakers_per_batchequals to the number of emotion classes inencoder/params_model.py.utterances_per_speakercan be tuned. - run the command:
python encoder_train.py <my_run> <datasets_root>/SV2TTS/encoder
<my_run> is like a session name, can be defined as any name. And the training process can be seen in the visdom port.
visdom can be disabled using --no_visdom
There is a pretrained model in the saved_models/default folder if don't want to train from scratch.
run the command:
python generate_embed.py
Can use the following options:
-s: the path to the source audio (e.g../train_audio_eng/IEMOCAP)-d: the path to the destination embedding.