RAG Enhancement: Hybrid Semantic + Full-Text Search for Media Classification

## Overview
Enhance the Retrieval Augmented Generation (RAG) system by implementing a true hybrid search pipeline combining vector similarity (semantic search) and keyword-based (full-text) search. This matches the latest best practices for RAG reliability and accuracy ([Reference 1](https://www.microsoft.com/en-us/microsoft-cloud/blog/2025/02/04/common-retrieval-augmented-generation-rag-techniques-explained/), [Reference 2](https://collabnix.com/rag-retrieval-augmented-generation-the-complete-guide-to-building-intelligent-ai-systems-in-2025/)).

## Why?
- Increases recall and relevance of retrieved matches
- Catches both semantic similarity and explicit franchise/studio/keyword matches
- Reduces user corrections on edge cases

## Implementation Plan
1. **Backend (Node/Express)**
    - Update `RAGRetriever.hybridSearch()` in `server/src/services/ragRetriever.js` to fuse pgvector results with full-text search from `classification_history`.
    - Use Reciprocal Rank Fusion (RRF) or a weighted average as in top RAG systems ([Reference 5](https://arxiv.org/abs/2501.07391)).

    ```js
    // Example hybrid fusion tweak (Node.js excerpt)
    // Inside hybridSearch()
    let results;
    if (fusionMethod === 'rrf') {
        results = this.calculateRRF(semanticMatches, textMatches, rrfK);
    } else {
        results = this.legacyHybridCombine(semanticMatches, textMatches, limit);
    }
    ```

    - Tune weights between vector and keyword match. Consider testing different fusion algorithms.
    - Update tests in `server/src/__tests__/ragRetriever.rrf.test.js` for all fusion paths.
2. **Database**
    - Ensure `classification_history` has a full-text search index (e.g., on the title, overview, genres columns).
    - Example SQL:
    ```sql
    CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_history_fts 
        ON classification_history USING gin(to_tsvector('english', title || ' ' || overview || ' ' || genres));
    ```
3. **Validation**
    - Benchmark accuracy/recall against pure vector and pure text retrieval

## References
- [MS RAG Techniques](https://www.microsoft.com/en-us/microsoft-cloud/blog/2025/02/04/common-retrieval-augmented-generation-rag-techniques-explained/)
- [Collabnix: RAG Guide 2025](https://collabnix.com/rag-retrieval-augmented-generation-the-complete-guide-to-building-intelligent-ai-systems-in-2025/)
- [RRF Paper](https://arxiv.org/abs/2501.07391)

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Enhancement: Hybrid Semantic + Full-Text Search for Media Classification #274

Overview

Why?

Implementation Plan

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RAG Enhancement: Hybrid Semantic + Full-Text Search for Media Classification #274

Description

Overview

Why?

Implementation Plan

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions