Run complete AI dubbing pipelines with speaker diarization, voice cloning, translation, rewriting, grammar correction, translation & rewriting, background restoration, and automatic subtitles with support for multi-speaker videos and dubbing in π 23 languages.
Run the full pipeline on a free T4 GPU:
Clone the speakerβs voice and generate accurate subtitles from the processed audio.
Dub videos in any supported AI voice while restoring the original background music or ambient noise for a natural feel. Subtitles are automatically generated for the dubbed output.
- The app generates a ready-to-use prompt for translation or rewriting.
- The user simply copies the prompt and pastes it into a free AI platform (Google AI Studio, ChatGPT, etc.).
- The translated or rewritten text is then pasted back into the app.
- This enables the use of premium LLM quality without any paid API calls or subscriptions.
- For longer videos, the app also provides local translation (Hunyuan-MT-7B-GGUF) or Google Translate support, though these options may not match the quality of the latest advanced LLMs.
This dubbing pipeline supports multiple text-processing modes:
Translate text from one language to another ideal for multilingual dubbing.
Correct grammar, spelling, and sentence structure without changing the meaning. Used when the speakerβs grammar is incorrect but the content should remain the same.
Rewrite sentences into clean, natural, professionally phrased speech. Useful when the original audio has broken grammar, slang, or unclear phrasing.
Translate the video and produce polished, natural sentences in the target language. Best for high-quality international dubbing.
Separates vocals from background music or ambient noise. π https://github.com/facebookresearch/demucs
Fast, accurate speech-to-text for large videos. π https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2
Detects and identifies multiple speakers. π https://github.com/pyannote/pyannote-audio
4. Google AI Studio (Gemini 3 Pro Preview) (But you can use any AI models) β Translation & Rewriting
High-quality translation and text rewriting using Gemini models, as they support longer text generation. π https://aistudio.google.com/
Local GPU-friendly multilingual translation model. π https://huggingface.co/mradermacher/Hunyuan-MT-7B-GGUF
Simple API-based translation. π https://pypi.org/project/googletrans/
Generate cloned voices and multilingual synthetic speech. π https://github.com/resemble-ai/chatterbox
Trimming, merging, format conversion, audio mixing. π https://www.ffmpeg.org/
Logic, processing, audio manipulation, ML pipelines.
Builds the interactive web UI for your application. π https://www.gradio.app/
Run the full dubbing system on free cloud GPU.
Helpful for debugging, writing utilities, and optimizing logic.
The current dubbing logic in audio_sync_pipeline.py achieves roughly 70% accuracy and struggles to perfectly synchronize the AI-generated voice with the original speech.
Key issues include:
- βοΈ Incorrect speech speed
- βοΈ Mismatch in rhythm and pacing
- βοΈ Lack of natural timing variations
The goal is for the TTS output to match real human timing, creating smooth, natural, and believable dubbing.
If the generated TTS audio is too long, an LLM could be used to shorten or compress the rewritten sentence before regenerating speech. However, this approach has limitations:
- Requires a local LLM (needs a strong GPU), or
- Requires a paid API, Both of which may be impractical for many users.
β Redubbing support has been added, but the current user interface is still rough. This feature is designed for manual copy-paste LLM prompts (Gemini, ChatGPT, or other LLMs), allowing sentence shortening without relying on paid API calls.
The current system does not analyze or replicate emotional tone from the original speakers. This leads to flat or inappropriate emotions in the dubbed audio.
For example:
- If the original speaker sounds sad, the dubbed version should also sound sad.
- If the speaker is excited, angry, or calm, the dubbing should reflect that emotion.
- Chatterbox multilingual TTS does not support emotional voice generation.
- The pipeline does not perform emotion detection on the input audio segments.
- Detect emotions in each audio segment (e.g., happy, sad, angry, neutral).
- Replace Chatterbox with a voice-cloning tts model that supports emotional control.
- Apply the detected emotion to the cloned voice during TTS generation.
This would produce far more natural and expressive dubbing results.
Based on the implementation by @rafaelgalle. π https://github.com/rafaelgalle/whisper-diarization-advanced
Used for multilingual text-to-speech and voice cloning. π https://github.com/resemble-ai/chatterbox
Hereβs a clean, professional credit section acknowledging Chatterbox and emphasizing that your project depends on it:
This project would not be possible without Chatterbox, the open-source multilingual TTS and voice cloning system developed by Resemble AI.
Chatterbox provides the core text-to-speech and voice cloning capabilities that make high-quality multilingual dubbing achievable in this project.
π https://github.com/resemble-ai/chatterbox
This project uses AI-based voice cloning & dubbing technologies. Users must follow responsible and ethical usage guidelines:
- Do not impersonate individuals without permission.
- Do not create deceptive or harmful content.
- Respect privacy, copyright, and local laws.
- You are fully responsible for how you use this tool.



