A unified speech-language model that synchronizes speech and text into a single, cohesive stream via 1:1 alignment.
TADA achieves high-fidelity synthesis and generation with a fraction of the computational overhead required by traditional models. By leveraging a novel tokenizer and architectural design, each autoregressive step covers one text token, dynamically determining its duration and prosody — eliminating fixed frame rates and transcript hallucination.
- 1:1 Token Alignment — The tokenizer encodes audio into a sequence of vectors that perfectly matches the number of text tokens.
- Dynamic Duration Synthesis — Generates the full speech segment for a text token in a single autoregressive step, regardless of length.
- Dual-Stream Generation — Generates a text token and the speech for the preceding token simultaneously, maintaining the same context length as text-only generation.
- Efficiency & Reliability — Superior expressiveness and natural flow while significantly reducing computational cost.
TADA unifies modalities by ensuring that for every word or subword token, there is exactly one corresponding speech vector. This synchronized stream allows the model to "understand" the precise timing of speech relative to text.
Most TTS models require a fixed number of steps to produce one second of audio (e.g., 50 frames per second). TADA breaks this constraint:
- Each autoregressive step covers one text token.
- The model dynamically determines the duration and prosody for that specific token.
- This results in a more natural flow and eliminates transcript hallucination.
![]() |
![]() |
![]() |
![]() |
pip install hume-tadagit clone https://github.com/HumeAI/tada.git
cd tada
pip install -e .| Model | Base Model | HuggingFace Hub |
|---|---|---|
| TADA-1B | Llama 3.2 1B | HumeAI/tada-1b |
| TADA-3B-ML | Llama 3.2 3B | HumeAI/tada-3b-ml |
All models use the same encoder (HumeAI/tada-codec) and can be loaded using the same API.
import torch
import torchaudio
from tada.modules.encoder import Encoder
from tada.modules.tada import TadaForCausalLM
device = "cuda"
encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder").to(device)
model = TadaForCausalLM.from_pretrained("HumeAI/tada-3b-ml").to(device)
audio, sample_rate = torchaudio.load("samples/ljspeech.wav")
audio = audio.to(device)
prompt_text = "The examination and testimony of the experts, enabled the commission to conclude that five shots may have been fired."
prompt = encoder(
audio, text=[prompt_text], sample_rate=sample_rate
)
output = model.generate(
prompt=prompt,
text="Please call Stella. Ask her to bring these things with her from the store.",
)TADA supports multilingual speech synthesis via language-specific aligners. Pass the language parameter when loading the encoder to use the appropriate aligner for your target language.
import torch
import torchaudio
from tada.modules.encoder import Encoder
from tada.modules.tada import TadaForCausalLM
device = "cuda"
encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder", language="ja").to(device)
model = TadaForCausalLM.from_pretrained("HumeAI/tada-3b-ml").to(device)
# Load a reference audio clip in the target language
audio, sample_rate = torchaudio.load("samples/ja_prompt.wav")
audio = audio.to(device)
# For non-English prompts, provide the transcript so the encoder uses forced alignment
# instead of the built-in ASR (which is English-only)
prompt_text = "このムキムキのお兄さんがいるし バーだし少し高そうだと思いますよねこのバーの料金設定は良心的でした まあそんなに高くなかったです"
prompt = encoder(audio, text=[prompt_text], sample_rate=sample_rate)
output = model.generate(
prompt=prompt,
text="今日はとても良い天気ですね。散歩に行きましょう。",
)Supported languages: ar, ch, de, es, fr, it, ja, pl, pt. When language is not specified, the default English aligner is used.
Note: For non-English prompts, you should provide the transcript of the reference audio via the
textparameter. The encoder's built-in ASR is English-only. The generation will still work, but alignment quality will be degraded.
Provide num_extra_steps if you want to generate text-speech continuation of the prompt:
output = model.generate(
prompt=prompt,
num_extra_steps=50
)If you use this project in your research, please cite our paper:
@article{dang2026tada,
title={TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment},
author={Dang, Trung and Rao, Sharath and Gupta, Ananya and Gagne, Christopher and Tzirakis, Panagiotis and Baird, Alice and Cłapa, Jakub Piotr and Chin, Peter and Cowen, Alan},
journal={arXiv preprint arXiv:2602.23068},
year={2026}
}Hume AI is an empathic AI research company. We research the datasets, tools, and models needed to give empathy to AI models to serve human wellbeing. If you're interested in any of our product or research collaborations, please reach out to us at hello@hume.ai.



