DeepDub is an automated pipeline that dubs video content into other languages while preserving the original speaker's voice. It uses a chain of state-of-the-art AI models to transcribe, translate (with cultural nuance), and clone voices locally.
The pipeline consists of four distinct modules:
- The Ear (Transcription): Uses Faster-Whisper to extract audio and generate precise timestamps.
- The Brain (Translation): Uses Llama 3.2 (via Ollama) with a custom system prompt to perform context-aware translation (e.g., understanding that "chop" in a kitchen means "cut," not "pork chop").
- The Voice (Cloning): Uses Coqui XTTS v2 to clone the original speaker's timbre and generate speech in the target language (Spanish, Hindi, etc.).
- The Editor (Assembly): Uses FFmpeg to surgically insert the new audio segments at the correct timestamps, mixing them with the original background noise.
- Language: Python
- Transcription:
faster-whisper(OpenAI Whisper optimized) - Translation:
ollamarunningllama3.2(Local LLM) - Voice Cloning:
TTS(Coqui XTTS v2) - Media Processing:
ffmpeg-python
-
Clone the repository
git clone [https://github.com/yourusername/DeepDub.git](https://github.com/yourusername/DeepDub.git) cd DeepDub -
Install Dependencies (Note: Microsoft C++ Build Tools are required for Coqui TTS on Windows)
pip install -r requirements.txt
-
Install Local LLM Download Ollama and pull the lightweight model:
ollama pull llama3.2
- Place your video file in the root folder and rename it to
input_video.mp4. - Step 1: Extract & Transcribe
python 1_transcribe.py
- Step 2: Smart Translation
python 2_translate_llm.py
- Step 3: Generate Voice Clones
python 3_clone.py
- Step 4: Merge Video
python 4_merge.py
- Done! Check
final_dubbed_video.mp4for the result.
- Lip Sync: Implement
Wav2Lipto match mouth movements to the new language. - Background Noise Separation: Use
Spleeterto isolate voice from music for cleaner mixing. - GUI: Build a Streamlit interface for drag-and-drop dubbing.