Upload README.md
Browse files
README.md
CHANGED
|
@@ -8,42 +8,27 @@ pipeline_tag: text-to-speech
|
|
| 8 |
---
|
| 9 |
🚨 **This repository is undergoing maintenance.**
|
| 10 |
|
| 11 |
-
✨
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
✨ You can also [`pip install misaki`](https://pypi.org/project/misaki/), a G2P library designed for Kokoro: https://github.com/hexgrad/misaki
|
| 16 |
-
|
| 17 |
-
♻️ You can access old files for v0.19 at https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19
|
| 18 |
|
| 19 |
❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
| Model | Published | Training Data | Compute (A100 80GB) | Released Voices | Released Langs |
|
| 24 |
-
| ----- | --------- | ------------- | ------------------- | --------------- | -------------- |
|
| 25 |
-
| v0.19 | 2024 Dec 25 | <100 hrs | 500 hrs @ $400 | 10 | 1 |
|
| 26 |
-
| **v1.0** | 2025 Jan 27 | Few hundred hrs | 1000 hrs @ $1000 | [31+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | [3+](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) |
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
-
|
| 34 |
-
-
|
| 35 |
-
- Objectively, poor audio quality in training data (compression, sample rate, artifacts) lowers inference quality.
|
| 36 |
-
- Objectively, text-audio misalignment alignment (too much text i.e. hallucinations, or not enough text i.e. failed transcriptions) lowers inference quality.
|
| 37 |
-
|
| 38 |
-
Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
|
| 39 |
-
|
| 40 |
-
Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:
|
| 41 |
-
- **Weakness** on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
|
| 42 |
-
- **Rushing** on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the `speed` parameter to mitigate this.
|
| 43 |
|
| 44 |
### Usage
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
| 47 |
```py
|
| 48 |
# 1️⃣ Install kokoro
|
| 49 |
!pip install -q kokoro>=0.2.3 soundfile
|
|
@@ -85,6 +70,33 @@ for i, (gs, ps, audio) in enumerate(generator):
|
|
| 85 |
sf.write(f'{i}.wav', audio, 24000) # save each audio file
|
| 86 |
```
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
### Model Facts
|
| 89 |
|
| 90 |
**Architecture:**
|
|
@@ -96,7 +108,7 @@ for i, (gs, ps, audio) in enumerate(generator):
|
|
| 96 |
|
| 97 |
**Trained by**: `@rzvzn` on Discord
|
| 98 |
|
| 99 |
-
**
|
| 100 |
|
| 101 |
**Model SHA256 Hash:** `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`
|
| 102 |
|
|
@@ -122,4 +134,11 @@ The following CC BY audio was part of the dataset used to train Kokoro v1.0.
|
|
| 122 |
| [Koniwa](https://github.com/koniwa/koniwa) `tnc` | <1h | [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/deed.ja) | v0.19 / 22 Nov 2024 |
|
| 123 |
| [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) | <11h | [CC BY 4.0](https://datashare.ed.ac.uk/bitstream/handle/10283/2353/license_text) | v0.19 / 22 Nov 2024 |
|
| 124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
|
|
|
|
| 8 |
---
|
| 9 |
🚨 **This repository is undergoing maintenance.**
|
| 10 |
|
| 11 |
+
✨ v1.0 release is underway! Things are not yet finalized, but you can start [using v1.0 now](https://huggingface.co/hexgrad/Kokoro-82M#usage).
|
| 12 |
|
| 13 |
+
♻️ Old v0.19 files: https://hf.co/hexgrad/kLegacy/tree/main/v0.19
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
|
| 16 |
|
| 17 |
+
**Kokoro** is a multilingual TTS model with 82 million parameters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
- [Usage](https://huggingface.co/hexgrad/Kokoro-82M#usage)
|
| 20 |
+
- [Releases](https://huggingface.co/hexgrad/Kokoro-82M#releases)
|
| 21 |
+
- [Voices and Languages](https://huggingface.co/hexgrad/Kokoro-82M#voices-and-languages)
|
| 22 |
+
- [Model Facts](https://huggingface.co/hexgrad/Kokoro-82M#model-facts)
|
| 23 |
+
- [Training Details](https://huggingface.co/hexgrad/Kokoro-82M#training-details)
|
| 24 |
+
- [Creative Commons Attribution](https://huggingface.co/hexgrad/Kokoro-82M#creative-commons-attribution)
|
| 25 |
+
- [Acknowledgements](https://huggingface.co/hexgrad/Kokoro-82M#acknowledgements)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
### Usage
|
| 28 |
|
| 29 |
+
[`pip install kokoro`](https://pypi.org/project/kokoro/) installs the inference library at https://github.com/hexgrad/kokoro
|
| 30 |
+
|
| 31 |
+
You can run this cell on [Google Colab](https://colab.research.google.com/).
|
| 32 |
```py
|
| 33 |
# 1️⃣ Install kokoro
|
| 34 |
!pip install -q kokoro>=0.2.3 soundfile
|
|
|
|
| 70 |
sf.write(f'{i}.wav', audio, 24000) # save each audio file
|
| 71 |
```
|
| 72 |
|
| 73 |
+
Behind the scenes, the `kokoro` library uses [`misaki`](https://pypi.org/project/misaki/), a G2P library: https://github.com/hexgrad/misaki
|
| 74 |
+
|
| 75 |
+
### Releases
|
| 76 |
+
|
| 77 |
+
| Model | Published | Training Data | Compute (A100 80GB) | Released Langs & Voices | SHA256 |
|
| 78 |
+
| ----- | --------- | ------------- | ------------------- | ----------------------- | ------ |
|
| 79 |
+
| **v1.0** | 2025 Jan 27 | Few hundred hrs | $1000 for 1000 hrs | [3 & 31](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md) | `496dba11` |
|
| 80 |
+
| [v0.19](https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19) | 2024 Dec 25 | <100 hrs | $400 for 500 hrs | 1 & 10 | `3b0c392f` |
|
| 81 |
+
|
| 82 |
+
Training is continuous, so the compute footprints overlap.
|
| 83 |
+
|
| 84 |
+
v0.19 is now deprecated. You can access at old v0.19 files at https://hf.co/hexgrad/kLegacy/tree/main/v0.19
|
| 85 |
+
|
| 86 |
+
### Voices and Languages
|
| 87 |
+
|
| 88 |
+
Voices are listed in [VOICES.md](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md). Not all voices are created equal:
|
| 89 |
+
- Subjectively, voices will sound better or worse to different people.
|
| 90 |
+
- Objectively, having less training data for a given voice (minutes instead of hours) lowers inference quality.
|
| 91 |
+
- Objectively, poor audio quality in training data (compression, sample rate, artifacts) lowers inference quality.
|
| 92 |
+
- Objectively, text-audio misalignment alignment (too much text i.e. hallucinations, or not enough text i.e. failed transcriptions) lowers inference quality.
|
| 93 |
+
|
| 94 |
+
Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
|
| 95 |
+
|
| 96 |
+
Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:
|
| 97 |
+
- **Weakness** on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
|
| 98 |
+
- **Rushing** on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the `speed` parameter to mitigate this.
|
| 99 |
+
|
| 100 |
### Model Facts
|
| 101 |
|
| 102 |
**Architecture:**
|
|
|
|
| 108 |
|
| 109 |
**Trained by**: `@rzvzn` on Discord
|
| 110 |
|
| 111 |
+
**Languages:** American English, British English, French, Hindi
|
| 112 |
|
| 113 |
**Model SHA256 Hash:** `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`
|
| 114 |
|
|
|
|
| 134 |
| [Koniwa](https://github.com/koniwa/koniwa) `tnc` | <1h | [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/deed.ja) | v0.19 / 22 Nov 2024 |
|
| 135 |
| [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) | <11h | [CC BY 4.0](https://datashare.ed.ac.uk/bitstream/handle/10283/2353/license_text) | v0.19 / 22 Nov 2024 |
|
| 136 |
|
| 137 |
+
### Acknowledgements
|
| 138 |
+
- [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2.
|
| 139 |
+
- [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena.
|
| 140 |
+
- Thank you to everyone who contributed synthetic training data.
|
| 141 |
+
- Special thanks to those who donated compute.
|
| 142 |
+
- Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also the name of an [AI in the Terminator franchise](https://terminator.fandom.com/wiki/Kokoro).
|
| 143 |
+
|
| 144 |
<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
|