Skip to content

AI-powered domain-specific English learning assistant with Hybrid Search (SQL+Vector), dynamic Visual Context, AI TTS (Text-to-Speech) caching, interactive voice recording, and contextual LLM explanations

Notifications You must be signed in to change notification settings

Eric-LLMs/DeepGloss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 DeepGloss

Python 3.10+ Streamlit SQLite ChromaDB License: MIT

DeepGloss is a smart, domain-specific English learning assistant built with Streamlit and powered by Large Language Models (LLMs) and Vector Database technology.

Unlike generic dictionary apps, DeepGloss focuses on contextual learning within specific domains (e.g., "Stanford CS336 Lectures", "Legal English", "Medical Terms"). It allows users to import customized vocabulary and corpus, automatically fetches definitions, generates Text-to-Speech (TTS) audio, extracts dynamic visual context (images) to aid in understanding complex professional vocabulary, provides context-aware AI translations, and offers interactive voice recording for pronunciation comparison. Crucially, it features Hybrid Search (SQL + Vector) to find relevant example sentences even when exact keywords are missing, and empowers users with an Efficient Library Governance suite for deterministic data maintenance and strategic library pruning..


πŸ“Έ Screenshots

1. Clean & Modern Vocabulary List

Seamlessly sort, search, and view inline definitions via hover popovers without leaving the page.

Vocabulary List

2. Interactive Study Dialog

Practice pronunciation with the built-in HTML5 mic widget, compare with native TTS, visualize abstract concepts with automatically fetched contextual images, and get AI-powered contextual explanations. The system retrieves sentences via both Keyword Match and Semantic Vector Search.

Practice Dialog

3. Smart Data Import Center

Manage domains and import vocabulary, raw corpus (SQL), and semantic embeddings (VectorDB) with intelligent deduplication in one place.

Data Import

4. Efficient Library Governance

Toggle terms with smooth blue/gray switches, rate importance with star icons, use "Double-Click to Edit" for definitions, and transactional page-level saves to keep the workspace clean yet powerful.

Manage Vocabulary


✨ Key Features

πŸ“₯ Smart Data Ingestion

  • Domain Management: Organize your learning materials into isolated domains.
  • Flexible Import: Import vocabulary (with frequencies) and contextual sentences via CSV/Excel/TXT uploads or manual entry.
  • Intelligent Deduplication: Automatically skips existing terms during import (case-insensitive) to maintain a clean database.
  • Vector Indexing: One-click generation of embeddings for your corpus using the industrial-grade BGE-M3 model (via ChromaDB) to enable semantic search.

πŸ“– Minimalist & Powerful Study Mode

  • Client-side Pagination & Sorting: Lightning-fast UI with in-memory pagination. Sort vocabulary by Word (A-Z), Frequency, or Importance Level (Stars).
  • Advanced Filtering: Filter your study list by specific domains or star levels.
  • Real-time Search: Instantly find terms in your current list with a responsive search bar.
  • Hover Definitions: Clean UI using popovers to view definitions without leaving the list.

πŸ€– AI-Powered Interactive Study

  • Seamless Navigation: Switch instantly between words using "⬅️ Prev" and "Next ➑️" buttons without closing the dialog, ensuring an uninterrupted learning flow.
  • Hybrid Search Engine: Combines SQLite (Exact Match) and ChromaDB (Semantic Match). If an exact sentence isn't found, it finds the most semantically similar sentence from the VectorDB (e.g., searching "GQA" finds sentences about "Group Query Attention").
  • Context-Aware Explanations: Uses LLMs to translate sentences and explain exactly what a term means within that specific context.
  • Auto-Fetch Definitions: If a term lacks a definition, the system automatically calls the LLM in the background to fetch a precise English definition and Chinese translation.
  • Visual Context for Professional Vocabulary:
    • Multi-Dimensional Image Search: Grasp complex or abstract terms instantly. The system automatically scrapes Google Images (with Bing as a seamless fallback) using a combined 3-tier strategy: Term alone, Term + Definition, and Term + Contextual Sentence to fetch highly accurate visual representations.
    • Asynchronous Loading & Randomized Regeneration: Images load via a non-blocking UI mechanism (with a JS loading spinner) so you can study text while images fetch in the background. Not satisfied with the first batch? Click Regenerate to randomly sample a new set of images from a broader candidate pool of top search results, ensuring diverse visual perspectives.
    • Local Image Caching: Once saved, images are downloaded directly to your local cache and linked via relative paths in the SQLite database, ensuring zero-latency loads and offline availability for future reviews.
  • Built-in Mic Widget: Record your own voice directly in the browser and compare it with the generated TTS audio for pronunciation practice.
  • Audio & Pronunciation:
    • Generate high-quality TTS audio for words and full sentences on the fly.
    • Local Audio Caching: Generated audio is cached locally (path configurable via config.yaml) to save API costs and speed up loading.
  • Importance Rating: Rate terms from 1 to 5 stars (⭐⭐⭐⭐⭐) to prioritize your learning.

πŸ› οΈ Efficient Library Governance

  • Efficient Toggles: Instantly enable/disable terms with visual feedback (Blue for ON, Gray for OFF).
  • Double-Click to Expand: Definitions are displayed as clean labels and only expand into multi-line editors upon double-clicking, preventing accidental edits.
  • Unified Visuals: Star levels are managed via intuitive icon pickers (⭐) instead of raw numbers.
  • Transactional Page Commits: Commit all modifications on a single page with one click, ensuring high-speed bulk updates while maintaining data integrity through a defensive "Save-then-Navigate" workflow.
  • Global Operation Flow: Perform global sorting across the entire database (Frequency, Word, Level) and save changes page-by-page to ensure data integrity.
  • Self-Healing Logic: Automatically deduplicates legacy "dirty data" in SQL matches to ensure UI stability.

πŸ› οΈ Technology Stack

  • Frontend: Streamlit (Custom CSS, JS & Components)
  • Backend: Python 3.10+
  • Database:
    • Structured: SQLite3 (Metadata, Terms, Links, Image/Audio Paths)
    • Vector: ChromaDB (Semantic Embeddings)
  • AI Models:
    • LLM: OpenAI / DeepSeek / Moonshot (via OpenAI-compatible API)
    • Embedding: BAAI/bge-m3 (State-of-the-art English/Chinese embedding)
  • Data Processing: Pandas, Regex
  • Web Scraping: Native urllib & re (Lightweight Google/Bing Image extraction)
  • Config: YAML + DotEnv

πŸ“‚ Project Architecture

DeepGloss follows a clean, modular, and maintainable architecture separating UI, services, and storage logic:

DeepGloss/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ database/        # SQLite logic & schemas
β”‚   β”‚   β”œβ”€β”€ db_manager.py
β”‚   β”‚   └── schema.sql
β”‚   β”œβ”€β”€ services/        # AI & Core Services
β”‚   β”‚   β”œβ”€β”€ ingestion.py     # Text processing
β”‚   β”‚   β”œβ”€β”€ llm_client.py    # Universal LLM client
β”‚   β”‚   β”œβ”€β”€ tts_manager.py   # Text-to-Speech with caching
β”‚   β”‚   └── vector_manager.py# ChromaDB Vector operations 
β”‚   β”œβ”€β”€ ui/              # Modular UI components
β”‚   β”‚   β”œβ”€β”€ mic_widget.py
β”‚   β”‚   β”œβ”€β”€ components.py
β”‚   β”‚   └── study_dialog.py
β”‚   └── utils/           # Helper scripts
β”‚       β”œβ”€β”€ image_scraper.py # Web scraping for contextual images 
β”‚       └── ...
β”œβ”€β”€ data/                # Data Storage
β”‚   β”œβ”€β”€ audio_cache/     # MP3 Cache (Auto-generated)
β”‚   β”œβ”€β”€ image_cache/     # Downloaded image assets (Auto-generated)
β”‚   β”œβ”€β”€ vector_store/    # ChromaDB Files (Auto-generated)
β”‚   └── deepgloss.db     # SQLite Database File
β”œβ”€β”€ pages/               # Streamlit Pages
β”‚   └── edit_vocabulary.py  # Efficient Library Governance
β”‚   β”œβ”€β”€ import_data.py   # Data Ingestion (Terms/SQL/Vector)
β”‚   └── study_mode.py    # Main Study Interface
β”œβ”€β”€ .env                 # API Keys (Git ignored)
β”œβ”€β”€ config.py            # Config Loader Script
β”œβ”€β”€ config.yaml          # App Settings (Paths, Models)
β”œβ”€β”€ main.py              # Entry Point
β”œβ”€β”€ requirements.txt     # Python Dependencies
└── start.bat            # Windows Quick-Start Script



πŸš€ Getting Started

1. Prerequisites

2. Clone the Repository

git clone https://github.com/Eric-LLMs/DeepGloss.git
cd DeepGloss

3. Install Dependencies

It is highly recommended to use Conda to manage the environment to ensure compatibility with PyTorch and VectorDB dependencies.

# 1. Create a new Conda environment named 'DeepGloss' with Python 3.10
conda create -n DeepGloss python=3.10 -y

# 2. Activate the environment
conda activate DeepGloss

# 3. Install required packages
# (Pip is used here to ensure strict compatibility with the requirements.txt file)
pip install -r requirements.txt

4. Configuration

Step A: API Keys (.env) Create a .env file in the root directory.

# Required: Your LLM API Key (OpenAI, DeepSeek, etc.)
LLM_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx

# Optional: Base URL (Defaults to OpenAI. Change for DeepSeek: https://api.deepseek.com)
LLM_BASE_URL=https://api.openai.com/v1

Step B: App Settings (config.yaml) Configure storage paths and models.

storage:
  # Path to store TTS audio. Relative paths work fine.
  audio_cache_path: "data/audio_cache"
  # Path to store fetched images for visual context.
  image_cache_path: "data/image_cache"

models:
  llm: "o3-mini"      # Model for explanation
  tts: "tts-1-hd"     # Model for speech
  tts_voice: "alloy"

5. Run the Application

Start the Streamlit development server:

streamlit run main.py

(Alternatively, simply double-click the start.bat file if you are on Windows.On first run, the system will automatically download the embedding model (~2GB) and initialize databases).


πŸ’‘ How to Use

  1. Create Domain: Navigate to Import Data -> Domain Management to start a new topic (e.g., "AI Research Papers").
  2. Import Terms: Switch to Import Vocabulary. Upload your vocabulary CSV or paste text directly.
  3. Build Corpus (Two Layers):
  • Layer 1 (SQL): Import sentences to Import Sentences (SQL) for exact keyword matching.
  • Layer 2 (Vector): Import raw text to Import VectorDB to enable AI Semantic Search.
  1. Interactive Study: Navigate to study_mode and click the 🀿 Deep Dive icon to open the modal where you can generate TTS audio, view AI definitions, record and compare your pronunciation, get context-aware sentence translations, visually understand concepts via Images, navigate seamlessly via Next/Prev buttons, and finally Save the best context and visuals to your database.

  2. Library Governance: Navigate to the Manage Vocabulary suite to perform global sorting (by frequency or star level), refine definitions via intentional double-click editing, and toggle term visibility to ensure a high-precision language library.


πŸ“ License

This project is licensed under the MIT License. See the LICENSE file for details.



About

AI-powered domain-specific English learning assistant with Hybrid Search (SQL+Vector), dynamic Visual Context, AI TTS (Text-to-Speech) caching, interactive voice recording, and contextual LLM explanations

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published