Text-to-Speech
English
hexgrad commited on
Commit
9e17464
·
1 Parent(s): 70bb171

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -29
README.md CHANGED
@@ -8,42 +8,27 @@ pipeline_tag: text-to-speech
8
  ---
9
  🚨 **This repository is undergoing maintenance.**
10
 
11
- Model v1.0 release is underway! Things are not yet finalized, but you can start [using v1.0 now](https://huggingface.co/hexgrad/Kokoro-82M#usage).
12
 
13
- You can now [`pip install kokoro`](https://pypi.org/project/kokoro/), a dedicated inference library: https://github.com/hexgrad/kokoro
14
-
15
- ✨ You can also [`pip install misaki`](https://pypi.org/project/misaki/), a G2P library designed for Kokoro: https://github.com/hexgrad/misaki
16
-
17
- ♻️ You can access old files for v0.19 at https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19
18
 
19
  ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
20
 
21
- ### Kokoro is getting an upgrade!
22
-
23
- | Model | Published | Training Data | Compute (A100 80GB) | Released Voices | Released Langs |
24
- | ----- | --------- | ------------- | ------------------- | --------------- | -------------- |
25
- | v0.19 | 2024 Dec 25 | <100 hrs | 500 hrs @ $400 | 10 | 1 |
26
- | **v1.0** | 2025 Jan 27 | Few hundred hrs | 1000 hrs @ $1000 | [31+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | [3+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) |
27
 
28
- Training is continuous. The v0.19 model was produced "on the way" to the v1.0 model, so the Compute footprints overlap.
29
-
30
- ### Voices and Languages
31
-
32
- Voices are listed in [VOICES.md](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md). Not all voices are created equal:
33
- - Subjectively, voices will sound better or worse to different people.
34
- - Objectively, having less training data for a given voice (minutes instead of hours) lowers inference quality.
35
- - Objectively, poor audio quality in training data (compression, sample rate, artifacts) lowers inference quality.
36
- - Objectively, text-audio misalignment alignment (too much text i.e. hallucinations, or not enough text i.e. failed transcriptions) lowers inference quality.
37
-
38
- Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
39
-
40
- Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:
41
- - **Weakness** on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
42
- - **Rushing** on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the `speed` parameter to mitigate this.
43
 
44
  ### Usage
45
 
46
- The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
 
 
47
  ```py
48
  # 1️⃣ Install kokoro
49
  !pip install -q kokoro>=0.2.3 soundfile
@@ -85,6 +70,33 @@ for i, (gs, ps, audio) in enumerate(generator):
85
  sf.write(f'{i}.wav', audio, 24000) # save each audio file
86
  ```
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  ### Model Facts
89
 
90
  **Architecture:**
@@ -96,7 +108,7 @@ for i, (gs, ps, audio) in enumerate(generator):
96
 
97
  **Trained by**: `@rzvzn` on Discord
98
 
99
- **Supported Languages:** American English, British English
100
 
101
  **Model SHA256 Hash:** `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`
102
 
@@ -122,4 +134,11 @@ The following CC BY audio was part of the dataset used to train Kokoro v1.0.
122
  | [Koniwa](https://github.com/koniwa/koniwa) `tnc` | <1h | [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/deed.ja) | v0.19 / 22 Nov 2024 |
123
  | [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) | <11h | [CC BY 4.0](https://datashare.ed.ac.uk/bitstream/handle/10283/2353/license_text) | v0.19 / 22 Nov 2024 |
124
 
 
 
 
 
 
 
 
125
  <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
 
8
  ---
9
  🚨 **This repository is undergoing maintenance.**
10
 
11
+ ✨ v1.0 release is underway! Things are not yet finalized, but you can start [using v1.0 now](https://huggingface.co/hexgrad/Kokoro-82M#usage).
12
 
13
+ ♻️ Old v0.19 files: https://hf.co/hexgrad/kLegacy/tree/main/v0.19
 
 
 
 
14
 
15
  ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
16
 
17
+ **Kokoro** is a multilingual TTS model with 82 million parameters.
 
 
 
 
 
18
 
19
+ - [Usage](https://huggingface.co/hexgrad/Kokoro-82M#usage)
20
+ - [Releases](https://huggingface.co/hexgrad/Kokoro-82M#releases)
21
+ - [Voices and Languages](https://huggingface.co/hexgrad/Kokoro-82M#voices-and-languages)
22
+ - [Model Facts](https://huggingface.co/hexgrad/Kokoro-82M#model-facts)
23
+ - [Training Details](https://huggingface.co/hexgrad/Kokoro-82M#training-details)
24
+ - [Creative Commons Attribution](https://huggingface.co/hexgrad/Kokoro-82M#creative-commons-attribution)
25
+ - [Acknowledgements](https://huggingface.co/hexgrad/Kokoro-82M#acknowledgements)
 
 
 
 
 
 
 
 
26
 
27
  ### Usage
28
 
29
+ [`pip install kokoro`](https://pypi.org/project/kokoro/) installs the inference library at https://github.com/hexgrad/kokoro
30
+
31
+ You can run this cell on [Google Colab](https://colab.research.google.com/).
32
  ```py
33
  # 1️⃣ Install kokoro
34
  !pip install -q kokoro>=0.2.3 soundfile
 
70
  sf.write(f'{i}.wav', audio, 24000) # save each audio file
71
  ```
72
 
73
+ Behind the scenes, the `kokoro` library uses [`misaki`](https://pypi.org/project/misaki/), a G2P library: https://github.com/hexgrad/misaki
74
+
75
+ ### Releases
76
+
77
+ | Model | Published | Training Data | Compute (A100 80GB) | Released Langs & Voices | SHA256 |
78
+ | ----- | --------- | ------------- | ------------------- | ----------------------- | ------ |
79
+ | **v1.0** | 2025 Jan 27 | Few hundred hrs | $1000 for 1000 hrs | [3 & 31](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | `496dba11` |
80
+ | [v0.19](https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19) | 2024 Dec 25 | <100 hrs | $400 for 500 hrs | 1 & 10 | `3b0c392f` |
81
+
82
+ Training is continuous, so the compute footprints overlap.
83
+
84
+ v0.19 is now deprecated. You can access at old v0.19 files at https://hf.co/hexgrad/kLegacy/tree/main/v0.19
85
+
86
+ ### Voices and Languages
87
+
88
+ Voices are listed in [VOICES.md](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md). Not all voices are created equal:
89
+ - Subjectively, voices will sound better or worse to different people.
90
+ - Objectively, having less training data for a given voice (minutes instead of hours) lowers inference quality.
91
+ - Objectively, poor audio quality in training data (compression, sample rate, artifacts) lowers inference quality.
92
+ - Objectively, text-audio misalignment alignment (too much text i.e. hallucinations, or not enough text i.e. failed transcriptions) lowers inference quality.
93
+
94
+ Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
95
+
96
+ Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:
97
+ - **Weakness** on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
98
+ - **Rushing** on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the `speed` parameter to mitigate this.
99
+
100
  ### Model Facts
101
 
102
  **Architecture:**
 
108
 
109
  **Trained by**: `@rzvzn` on Discord
110
 
111
+ **Languages:** American English, British English, French, Hindi
112
 
113
  **Model SHA256 Hash:** `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`
114
 
 
134
  | [Koniwa](https://github.com/koniwa/koniwa) `tnc` | <1h | [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/deed.ja) | v0.19 / 22 Nov 2024 |
135
  | [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) | <11h | [CC BY 4.0](https://datashare.ed.ac.uk/bitstream/handle/10283/2353/license_text) | v0.19 / 22 Nov 2024 |
136
 
137
+ ### Acknowledgements
138
+ - [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2.
139
+ - [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena.
140
+ - Thank you to everyone who contributed synthetic training data.
141
+ - Special thanks to those who donated compute.
142
+ - Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also the name of an [AI in the Terminator franchise](https://terminator.fandom.com/wiki/Kokoro).
143
+
144
  <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />