DeepGloss is a smart, domain-specific English learning assistant built with Streamlit and powered by Large Language Models (LLMs) and Vector Database technology.
Unlike generic dictionary apps, DeepGloss focuses on contextual learning within specific domains (e.g., "Stanford CS336 Lectures", "Legal English", "Medical Terms"). It allows users to import customized vocabulary and corpus, automatically fetches definitions, generates Text-to-Speech (TTS) audio, extracts dynamic visual context (images) to aid in understanding complex professional vocabulary, provides context-aware AI translations, and offers interactive voice recording for pronunciation comparison. Crucially, it features Hybrid Search (SQL + Vector) to find relevant example sentences even when exact keywords are missing, and empowers users with an Efficient Library Governance suite for deterministic data maintenance and strategic library pruning..
1. Clean & Modern Vocabulary List
Seamlessly sort, search, and view inline definitions via hover popovers without leaving the page.
2. Interactive Study Dialog
Practice pronunciation with the built-in HTML5 mic widget, compare with native TTS, visualize abstract concepts with automatically fetched contextual images, and get AI-powered contextual explanations. The system retrieves sentences via both Keyword Match and Semantic Vector Search.
3. Smart Data Import Center
Manage domains and import vocabulary, raw corpus (SQL), and semantic embeddings (VectorDB) with intelligent deduplication in one place.
4. Efficient Library Governance
Toggle terms with smooth blue/gray switches, rate importance with star icons, use "Double-Click to Edit" for definitions, and transactional page-level saves to keep the workspace clean yet powerful.
- Domain Management: Organize your learning materials into isolated domains.
- Flexible Import: Import vocabulary (with frequencies) and contextual sentences via CSV/Excel/TXT uploads or manual entry.
- Intelligent Deduplication: Automatically skips existing terms during import (case-insensitive) to maintain a clean database.
- Vector Indexing: One-click generation of embeddings for your corpus using the industrial-grade BGE-M3 model (via ChromaDB) to enable semantic search.
- Client-side Pagination & Sorting: Lightning-fast UI with in-memory pagination. Sort vocabulary by Word (A-Z), Frequency, or Importance Level (Stars).
- Advanced Filtering: Filter your study list by specific domains or star levels.
- Real-time Search: Instantly find terms in your current list with a responsive search bar.
- Hover Definitions: Clean UI using popovers to view definitions without leaving the list.
- Seamless Navigation: Switch instantly between words using "β¬ οΈ Prev" and "Next β‘οΈ" buttons without closing the dialog, ensuring an uninterrupted learning flow.
- Hybrid Search Engine: Combines SQLite (Exact Match) and ChromaDB (Semantic Match). If an exact sentence isn't found, it finds the most semantically similar sentence from the VectorDB (e.g., searching "GQA" finds sentences about "Group Query Attention").
- Context-Aware Explanations: Uses LLMs to translate sentences and explain exactly what a term means within that specific context.
- Auto-Fetch Definitions: If a term lacks a definition, the system automatically calls the LLM in the background to fetch a precise English definition and Chinese translation.
- Visual Context for Professional Vocabulary:
- Multi-Dimensional Image Search: Grasp complex or abstract terms instantly. The system automatically scrapes Google Images (with Bing as a seamless fallback) using a combined 3-tier strategy: Term alone, Term + Definition, and Term + Contextual Sentence to fetch highly accurate visual representations.
- Asynchronous Loading & Randomized Regeneration: Images load via a non-blocking UI mechanism (with a JS loading spinner) so you can study text while images fetch in the background. Not satisfied with the first batch? Click Regenerate to randomly sample a new set of images from a broader candidate pool of top search results, ensuring diverse visual perspectives.
- Local Image Caching: Once saved, images are downloaded directly to your local cache and linked via relative paths in the SQLite database, ensuring zero-latency loads and offline availability for future reviews.
- Built-in Mic Widget: Record your own voice directly in the browser and compare it with the generated TTS audio for pronunciation practice.
- Audio & Pronunciation:
- Generate high-quality TTS audio for words and full sentences on the fly.
- Local Audio Caching: Generated audio is cached locally (path configurable via
config.yaml) to save API costs and speed up loading.
- Importance Rating: Rate terms from 1 to 5 stars (βββββ) to prioritize your learning.
- Efficient Toggles: Instantly enable/disable terms with visual feedback (Blue for ON, Gray for OFF).
- Double-Click to Expand: Definitions are displayed as clean labels and only expand into multi-line editors upon double-clicking, preventing accidental edits.
- Unified Visuals: Star levels are managed via intuitive icon pickers (β) instead of raw numbers.
- Transactional Page Commits: Commit all modifications on a single page with one click, ensuring high-speed bulk updates while maintaining data integrity through a defensive "Save-then-Navigate" workflow.
- Global Operation Flow: Perform global sorting across the entire database (Frequency, Word, Level) and save changes page-by-page to ensure data integrity.
- Self-Healing Logic: Automatically deduplicates legacy "dirty data" in SQL matches to ensure UI stability.
- Frontend: Streamlit (Custom CSS, JS & Components)
- Backend: Python 3.10+
- Database:
- Structured: SQLite3 (Metadata, Terms, Links, Image/Audio Paths)
- Vector: ChromaDB (Semantic Embeddings)
- AI Models:
- LLM: OpenAI / DeepSeek / Moonshot (via OpenAI-compatible API)
- Embedding: BAAI/bge-m3 (State-of-the-art English/Chinese embedding)
- Data Processing: Pandas, Regex
- Web Scraping: Native
urllib&re(Lightweight Google/Bing Image extraction) - Config: YAML + DotEnv
DeepGloss follows a clean, modular, and maintainable architecture separating UI, services, and storage logic:
DeepGloss/
βββ app/
β βββ database/ # SQLite logic & schemas
β β βββ db_manager.py
β β βββ schema.sql
β βββ services/ # AI & Core Services
β β βββ ingestion.py # Text processing
β β βββ llm_client.py # Universal LLM client
β β βββ tts_manager.py # Text-to-Speech with caching
β β βββ vector_manager.py# ChromaDB Vector operations
β βββ ui/ # Modular UI components
β β βββ mic_widget.py
β β βββ components.py
β β βββ study_dialog.py
β βββ utils/ # Helper scripts
β βββ image_scraper.py # Web scraping for contextual images
β βββ ...
βββ data/ # Data Storage
β βββ audio_cache/ # MP3 Cache (Auto-generated)
β βββ image_cache/ # Downloaded image assets (Auto-generated)
β βββ vector_store/ # ChromaDB Files (Auto-generated)
β βββ deepgloss.db # SQLite Database File
βββ pages/ # Streamlit Pages
β βββ edit_vocabulary.py # Efficient Library Governance
β βββ import_data.py # Data Ingestion (Terms/SQL/Vector)
β βββ study_mode.py # Main Study Interface
βββ .env # API Keys (Git ignored)
βββ config.py # Config Loader Script
βββ config.yaml # App Settings (Paths, Models)
βββ main.py # Entry Point
βββ requirements.txt # Python Dependencies
βββ start.bat # Windows Quick-Start Script
git clone https://github.com/Eric-LLMs/DeepGloss.git
cd DeepGloss
It is highly recommended to use Conda to manage the environment to ensure compatibility with PyTorch and VectorDB dependencies.
# 1. Create a new Conda environment named 'DeepGloss' with Python 3.10
conda create -n DeepGloss python=3.10 -y
# 2. Activate the environment
conda activate DeepGloss
# 3. Install required packages
# (Pip is used here to ensure strict compatibility with the requirements.txt file)
pip install -r requirements.txt
Step A: API Keys (.env)
Create a .env file in the root directory.
# Required: Your LLM API Key (OpenAI, DeepSeek, etc.)
LLM_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx
# Optional: Base URL (Defaults to OpenAI. Change for DeepSeek: https://api.deepseek.com)
LLM_BASE_URL=https://api.openai.com/v1
Step B: App Settings (config.yaml)
Configure storage paths and models.
storage:
# Path to store TTS audio. Relative paths work fine.
audio_cache_path: "data/audio_cache"
# Path to store fetched images for visual context.
image_cache_path: "data/image_cache"
models:
llm: "o3-mini" # Model for explanation
tts: "tts-1-hd" # Model for speech
tts_voice: "alloy"
Start the Streamlit development server:
streamlit run main.py
(Alternatively, simply double-click the start.bat file if you are on Windows.On first run, the system will automatically download the embedding model (~2GB) and initialize databases).
- Create Domain: Navigate to
Import Data->Domain Managementto start a new topic (e.g., "AI Research Papers"). - Import Terms: Switch to
Import Vocabulary. Upload your vocabulary CSV or paste text directly. - Build Corpus (Two Layers):
- Layer 1 (SQL): Import sentences to
Import Sentences (SQL)for exact keyword matching. - Layer 2 (Vector): Import raw text to
Import VectorDBto enable AI Semantic Search.
-
Interactive Study: Navigate to
study_modeand click the π€Ώ Deep Dive icon to open the modal where you can generate TTS audio, view AI definitions, record and compare your pronunciation, get context-aware sentence translations, visually understand concepts via Images, navigate seamlessly via Next/Prev buttons, and finally Save the best context and visuals to your database. -
Library Governance: Navigate to the Manage Vocabulary suite to perform global sorting (by frequency or star level), refine definitions via intentional double-click editing, and toggle term visibility to ensure a high-precision language library.
This project is licensed under the MIT License. See the LICENSE file for details.



