ttsx — Modern Text-to-Speech CLI

A fast, modern CLI for managing TTS models and generating natural-sounding speech — including zero-shot voice cloning.

✅ Status: Phase 2.1 Complete — MVP + Voice Cloning functional!

Features

Feature	Status
Model search (HuggingFace Hub, live size fetching)	✅
Hardware VRAM compatibility checking	✅
Model install / list / remove / info	✅
Cache management with LRU eviction	✅
Speech generation (Qwen3-TTS, MLX models)	✅
Predefined voices (9 built-in speakers)	✅
Voice profiles (save & reuse cloned voices)	✅ New
Zero-shot voice cloning (`ttsx clone`)	✅ New
Batch processing	🚧 Planned
Voice customization (pitch, speed)	🚧 Planned

Quick Start

# Clone and install
git clone <repository-url>
cd ttsx
uv sync

# Check hardware (GPU/VRAM/RAM)
uv run ttsx hw

# Search for compatible models
uv run ttsx search "qwen tts" --compatible

# Install a model (~2-4 GB download)
uv run ttsx install Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

# Generate speech with a predefined voice
uv run ttsx generate "Hello world!" --voice Serena --output hello.wav

# Clone a voice from a reference audio file
uv run ttsx clone "Hello, this is my voice." --audio reference.wav --output cloned.wav

Installation

Requirements

Python 3.12+ (tested with 3.14)
PyTorch 2.5+
CUDA 12.1+ (optional, for GPU acceleration)
10–50 GB disk space depending on installed models

From Source

git clone <repository-url>
cd ttsx
uv sync               # installs all dependencies into .venv
uv run ttsx --help

For development (linters, tests):

uv sync --all-extras
uv run ruff check .
uv run ruff format .
uv run pytest

Command Reference

Hardware

ttsx hw                    # GPU, CPU, RAM, PyTorch info
ttsx hw --json             # Machine-readable JSON output
ttsx hw --verbose          # Detailed diagnostics

Model Management

# Search HuggingFace Hub (sizes + compatibility load concurrently)
ttsx search                          # browse popular TTS models
ttsx search "qwen"                   # keyword search
ttsx search --compatible             # only models that fit your VRAM
ttsx search --limit 10               # limit results

# Model info
ttsx info Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

# Install / list / remove
ttsx install Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
ttsx models                          # list installed models + cache stats
ttsx remove Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
ttsx remove Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --force

Speech Generation

Generate speech using a predefined voice or inline reference audio:

# Basic
ttsx generate "Hello world!"
ttsx generate "Hello world!" --output hello.wav
ttsx generate "Hello world!" --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

# Predefined voices
ttsx generate "Good morning." --voice Serena
ttsx generate "Good morning." --voice Ryan --output morning.wav

# Read text from file or stdin
ttsx generate --text-file script.txt --output narration.wav
echo "Hello world" | ttsx generate -

# Inline voice cloning (no saved profile needed)
ttsx generate "Hello." --ref-audio reference.wav
ttsx generate "Hello." --ref-audio reference.wav --ref-text "Reference transcript"

Available predefined voices

Voice	Voice	Voice
Aiden	Dylan	Eric
Ono_anna	Ryan	Serena
Sohee	Uncle_fu	Vivian

List voices for any installed model:

ttsx voices list --predefined
ttsx voices list --predefined --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Voice Cloning

Voice cloning lets you synthesize speech that sounds like a specific person using a short reference audio sample (3–30 seconds of clean speech works best).

There are two workflows:

1. Direct cloning (no saved profile)

Provide a reference audio file directly to ttsx clone:

# From an audio file — no transcript
ttsx clone "The meeting starts at nine." --audio reference.wav

# With transcript (strongly recommended — improves clone quality)
ttsx clone "The meeting starts at nine." \
  --audio reference.wav \
  --ref-text "Hi, this is a short sample of my voice for cloning." \
  --output output.wav

# From stdin
echo "Hello world" | ttsx clone - --audio reference.wav

# Specify model explicitly
ttsx clone "Hello." --audio reference.wav --model Qwen/Qwen3-TTS-12Hz-0.6B-Base

2. Saved voice profiles (recommended for reuse)

Save a voice once, use it anywhere:

# Save a voice profile
ttsx voices add narrator reference.wav
ttsx voices add narrator reference.wav \
  --ref-text "Hi, this is a sample of my voice." \
  --language English \
  --description "Deep narrator voice"

# List saved profiles
ttsx voices list

# View profile details
ttsx voices info narrator

# Generate speech with saved profile
ttsx clone "Chapter one. The story begins." --profile narrator
ttsx clone "Chapter one." --profile narrator --output chapter1.wav
ttsx clone --text-file chapter1.txt --profile narrator --output chapter1.wav

# Remove a profile
ttsx voices remove narrator
ttsx voices remove narrator --force     # skip confirmation

Voice profile options

Option	Description
`--ref-text`	Transcript of the reference audio. Strongly recommended — without it, the model uses x-vector mode (lower quality).
`--language`	Language of the voice (informational)
`--description`	Human-readable note
`--overwrite`	Replace an existing profile with the same name

Tips for best clone quality

3–15 seconds of clean, single-speaker speech works best
Provide --ref-text whenever possible
Use WAV at 16kHz+ for best results
Avoid noisy recordings, music, or multiple speakers
ttsx voices add warns you if the audio is too short or has a low sample rate

Supported Models

Qwen3-TTS (recommended)

Model	Size	Type	Notes
`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	~1.2 GB	CustomVoice	9 predefined speakers
`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	~3.4 GB	CustomVoice	Higher quality
`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	~1.2 GB	Base	Voice cloning
`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	~3.4 GB	Base	Voice cloning, higher quality

MLX (Apple Silicon only)

Search mlx-community for optimized Apple Silicon variants:

ttsx search "mlx qwen tts"

Configuration

Configuration lives at ~/.ttsx/config.toml (optional):

[general]
cache_dir = "~/.ttsx/models"     # where models are stored
max_cache_size_gb = 50           # maximum cache size

[generation]
device = "auto"                  # auto, cpu, cuda, mps

Environment variables override config file values:

Variable	Description
`TTSX_CACHE_DIR`	Override model cache directory
`TTSX_HF_TOKEN`	HuggingFace API token (for gated models)
`TTSX_DEVICE`	Force device: `cpu`, `cuda`, `mps`
`TTSX_LOG_LEVEL`	Logging verbosity: `DEBUG`, `INFO`, `WARNING`

Data Directories

~/.ttsx/
├── registry.json            # Installed models index
├── models/                  # Downloaded model weights
└── voices/                  # Voice profiles (Phase 2.1)
    ├── profiles.json        # Profile metadata
    └── audio/               # Managed copies of reference audio
        ├── narrator.wav
        └── alice.mp3

Documentation

TODO.md — Development roadmap and task tracking
ARCHITECTURE.md — System design and implementation details
AGENTS.md — Guide for AI agents working on this project

Project Status

Phase	Status
Phase 0 — Async/concurrent operations	✅ Complete
Phase 1 — MVP (search, install, generate, hardware)	✅ Complete
Phase 2.1 — Voice cloning + profiles	✅ Complete
Phase 2.2 — Batch processing	🚧 Planned
Phase 2.3 — Voice customization	🚧 Planned
Phase 3 — Optimization & polish	🚧 Planned

Tech Stack

Python 3.14 · uv (package management)
Typer + Rich (CLI framework + terminal UI)
PyTorch + qwen-tts (model runtime)
HuggingFace Hub (model registry)
soundfile + scipy (audio I/O)
Pydantic + pydantic-settings (data validation + configuration)

License

TBD

Active development — APIs and commands may change between versions.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
demo		demo
src/ttsx		src/ttsx
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ttsx — Modern Text-to-Speech CLI

Features

Quick Start

Installation

Requirements

From Source

Command Reference

Hardware

Model Management

Speech Generation

Available predefined voices

Voice Cloning

1. Direct cloning (no saved profile)

2. Saved voice profiles (recommended for reuse)

Voice profile options

Tips for best clone quality

Supported Models

Qwen3-TTS (recommended)

MLX (Apple Silicon only)

Configuration

Data Directories

Documentation

Project Status

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ttsx — Modern Text-to-Speech CLI

Features

Quick Start

Installation

Requirements

From Source

Command Reference

Hardware

Model Management

Speech Generation

Available predefined voices

Voice Cloning

1. Direct cloning (no saved profile)

2. Saved voice profiles (recommended for reuse)

Voice profile options

Tips for best clone quality

Supported Models

Qwen3-TTS (recommended)

MLX (Apple Silicon only)

Configuration

Data Directories

Documentation

Project Status

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages