Skip to content

afsalmarattil/note_taker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🎤 Note Taker

A modern, user-friendly web application for converting audio to text using state-of-the-art AI models. Built with Gradio and powered by Hugging Face Transformers.

✨ Features

  • Microphone Recording: Record audio directly in your browser
  • File Upload: Support for various audio formats (WAV, MP3, FLAC, M4A, etc.)
  • Real-time Transcription: Fast and accurate speech recognition
  • Timestamp Support: Optional timestamps for each transcribed segment
  • Clean UI: Modern, responsive interface built with Gradio
  • Local Processing: All processing happens locally (no data sent to external servers)

🚀 Quick Start

Prerequisites

  • Python 3.10 or higher
  • uv package manager (install from here)

Installation

  1. Clone or create the project directory:

    mkdir speech-to-text-app
    cd speech-to-text-app
  2. Install dependencies using uv:

    uv sync
  3. Run the application:

    uv run python app.py
  4. Open your browser: Navigate to http://localhost:7860 to use the application.

📖 Usage

Recording Audio

  1. Click on the "🎙️ Record Audio" tab
  2. Click the microphone button to start recording
  3. Speak clearly into your microphone
  4. Click stop when finished
  5. Click "Transcribe Recording" to convert to text

Uploading Audio Files

  1. Click on the "📁 Upload Audio File" tab
  2. Drag and drop or browse for your audio file
  3. Click "Transcribe File" to convert to text

Timestamps (Optional)

  • Use "Transcribe with Timestamps" buttons to get time-coded transcriptions
  • Useful for creating subtitles or precise audio analysis

🛠️ Technical Details

Dependencies

  • Gradio: Web interface framework
  • Transformers: Hugging Face model library
  • Librosa: Audio processing library
  • PyTorch: Deep learning framework
  • Soundfile: Audio file I/O

Model Information

  • Model: distil-whisper/distil-small.en
  • Language: English only
  • Sample Rate: 16kHz (automatically resampled)
  • Chunk Processing: 30-second chunks for long audio

Audio Processing Pipeline

  1. Load Audio: Using librosa with automatic format detection
  2. Convert to Mono: Stereo audio is converted to mono
  3. Resample: Audio is resampled to 16kHz if needed
  4. Transcribe: Processed through Whisper model
  5. Format Output: Clean text output with optional timestamps

🔧 Development

Project Structure

speech-to-text-app/
├── app.py              # Main application file
├── pyproject.toml      # Project dependencies and configuration
├── README.md           # This file
└── .python-version     # Python version specification (created by uv)

Adding Features

The modular design makes it easy to extend:

  • New Models: Replace the model in SpeechToTextApp.__init__()
  • Audio Formats: Librosa supports most common formats automatically
  • UI Customization: Modify the CSS and Gradio components
  • Processing Options: Add new transcription parameters

Development Setup

  1. Install development dependencies:

    uv sync --extra dev
  2. Run with auto-reload:

    uv run gradio app.py

🔍 Troubleshooting

Common Issues

  1. Model Download: First run may take time to download the model
  2. Memory Usage: Large audio files may require more RAM
  3. Browser Permissions: Ensure microphone access is granted
  4. Audio Format: If upload fails, try converting to WAV or MP3

Performance Tips

  • Shorter Clips: Under 5 minutes for best performance
  • Clear Audio: Minimal background noise improves accuracy
  • Good Microphone: Higher quality input = better transcription

📝 License

This project is open source and available under the MIT License.

🤝 Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

🙏 Acknowledgments


Made with ❤️ using Gradio and Transformers

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages