A real-time voice AI that joins Jitsi video calls and has actual conversations.
No cloud APIs. Runs on a laptop. Fully self-hosted.
- Joins a Jitsi Meet room as a participant named "Enki"
- Listens to speech via WebRTC audio capture
- Transcribes in real-time using Whisper (~0.6s latency)
- Responds with synthesized voice using Edge TTS
- All processing happens locally
You: "Hello, can you hear me?"
Enki: "Hello! Nice to hear from you!"
You: "What's your name?"
Enki: "I am Enki, god of wisdom and water. I'm your AI assistant."
| Component | Technology |
|---|---|
| Video Conferencing | Self-hosted Jitsi Meet (Docker) |
| Bot Browser | Puppeteer (Headless Chrome) |
| Speech-to-Text | faster-whisper (tiny.en model) |
| Text-to-Speech | edge-tts (Microsoft Neural Voices) |
| Audio Routing | PulseAudio virtual sinks |
- Linux (tested on Ubuntu 24.04 / WSL2)
- Node.js 18+
- Python 3.10+
- Docker & Docker Compose
- PulseAudio
- FFmpeg
# Clone Jitsi Docker setup
git clone https://github.com/jitsi/docker-jitsi-meet.git jitsi
cd jitsi
# Configure
cp env.example .env
# Edit .env and set:
# - JVB_ADVERTISE_IPS=127.0.0.1,<your-ip>
# - ENABLE_GUESTS=1
# Start
docker compose up -d# Clone this repo
git clone https://github.com/yaxzone/openclaw-talks-back.git
cd openclaw-talks-back
# Install Node dependencies
npm install
# Create Python venv and install Whisper
python3 -m venv venv
source venv/bin/activate
pip install faster-whisper edge-ttspactl load-module module-null-sink sink_name=VirtualMic sink_properties=device.description=VirtualMicnode voice-bot.jsOpen https://localhost:8443/Enki in your browser and start talking!
Environment variables:
| Variable | Default | Description |
|---|---|---|
ROOM |
Enki |
Jitsi room name to join |
WHISPER_VENV |
./venv/bin/python |
Path to Python with faster-whisper |
EDGE_TTS |
edge-tts |
Path to edge-tts binary |
OPENCLAW_URL |
http://localhost:18789/v1/responses |
OpenClaw API endpoint |
OPENCLAW_TOKEN |
(required) | OpenClaw gateway auth token |
The bot connects to OpenClaw's API for intelligent responses. This means you get a real AI conversation, not canned responses.
- Enable the responses endpoint in OpenClaw config:
{
gateway: {
http: {
endpoints: {
responses: { enabled: true }
}
}
}
}- Set your token:
export OPENCLAW_TOKEN="your-gateway-token"- Run the bot — it will now route speech to OpenClaw and speak the AI's response!
See ARCHITECTURE.md for detailed technical documentation including:
- System diagram
- Data flow (STT and TTS)
- Key challenges and solutions
- Performance metrics
-
ICE Connection Failures: Bot connects via localhost but JVB advertised different IP. Fixed by adding
127.0.0.1toJVB_ADVERTISE_IPS. -
Slow Whisper: Model reload per-chunk was too slow. Created persistent server that keeps model in memory.
-
TTS Audio Routing: Chrome's fake-media-stream only sends test patterns. Used PulseAudio virtual sinks to route TTS audio into WebRTC stream.
-
Echo Loop: Bot was transcribing its own TTS. Added
isSpeakingflag to skip transcription during playback.
| Metric | Value |
|---|---|
| STT Latency | ~0.6s per 4s chunk |
| TTS Generation | ~1-2s |
| End-to-end | ~5-6s |
| Memory (Chrome) | ~250MB |
| Memory (Whisper) | ~200MB |
For persistent operation (auto-start, auto-restart on crash):
# Copy service file
sudo cp jitsi-voice-bot.service /etc/systemd/system/
# Edit the service file to set your OPENCLAW_TOKEN
sudo nano /etc/systemd/system/jitsi-voice-bot.service
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable jitsi-voice-bot
sudo systemctl start jitsi-voice-bot
# View logs
sudo journalctl -u jitsi-voice-bot -fThe bot will now:
- Auto-start on system boot
- Auto-restart if it crashes
- Always be waiting in the configured room
- Wake word detection ("Hey Enki")
- Streaming STT for lower latency
-
Integration with LLM for intelligent responses✅ Done via OpenClaw API -
Systemd service for persistence✅ Done - Multiple room support
MIT
Built with 🔱 — February 2026