A modern, user-friendly web application for converting audio to text using state-of-the-art AI models. Built with Gradio and powered by Hugging Face Transformers.
- Microphone Recording: Record audio directly in your browser
- File Upload: Support for various audio formats (WAV, MP3, FLAC, M4A, etc.)
- Real-time Transcription: Fast and accurate speech recognition
- Timestamp Support: Optional timestamps for each transcribed segment
- Clean UI: Modern, responsive interface built with Gradio
- Local Processing: All processing happens locally (no data sent to external servers)
- Python 3.10 or higher
uvpackage manager (install from here)
-
Clone or create the project directory:
mkdir speech-to-text-app cd speech-to-text-app -
Install dependencies using uv:
uv sync
-
Run the application:
uv run python app.py
-
Open your browser: Navigate to
http://localhost:7860to use the application.
- Click on the "🎙️ Record Audio" tab
- Click the microphone button to start recording
- Speak clearly into your microphone
- Click stop when finished
- Click "Transcribe Recording" to convert to text
- Click on the "📁 Upload Audio File" tab
- Drag and drop or browse for your audio file
- Click "Transcribe File" to convert to text
- Use "Transcribe with Timestamps" buttons to get time-coded transcriptions
- Useful for creating subtitles or precise audio analysis
- Gradio: Web interface framework
- Transformers: Hugging Face model library
- Librosa: Audio processing library
- PyTorch: Deep learning framework
- Soundfile: Audio file I/O
- Model:
distil-whisper/distil-small.en - Language: English only
- Sample Rate: 16kHz (automatically resampled)
- Chunk Processing: 30-second chunks for long audio
- Load Audio: Using librosa with automatic format detection
- Convert to Mono: Stereo audio is converted to mono
- Resample: Audio is resampled to 16kHz if needed
- Transcribe: Processed through Whisper model
- Format Output: Clean text output with optional timestamps
speech-to-text-app/
├── app.py # Main application file
├── pyproject.toml # Project dependencies and configuration
├── README.md # This file
└── .python-version # Python version specification (created by uv)
The modular design makes it easy to extend:
- New Models: Replace the model in
SpeechToTextApp.__init__() - Audio Formats: Librosa supports most common formats automatically
- UI Customization: Modify the CSS and Gradio components
- Processing Options: Add new transcription parameters
-
Install development dependencies:
uv sync --extra dev
-
Run with auto-reload:
uv run gradio app.py
- Model Download: First run may take time to download the model
- Memory Usage: Large audio files may require more RAM
- Browser Permissions: Ensure microphone access is granted
- Audio Format: If upload fails, try converting to WAV or MP3
- Shorter Clips: Under 5 minutes for best performance
- Clear Audio: Minimal background noise improves accuracy
- Good Microphone: Higher quality input = better transcription
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit issues and pull requests.
- Hugging Face for the Transformers library and models
- Gradio for the excellent web UI framework
- OpenAI Whisper for the base model architecture
Made with ❤️ using Gradio and Transformers