Skip to content

refactor: migrate vector embedding from post-publication index to pre-publication hook #70

@rorybyrne

Description

@rorybyrne

Summary

With the unified hook model (#66), transformations that produce derived data now run as OCI hooks before record minting. Vector embedding for search is currently implemented as a post-publication index step (VectorIndexHandler consuming IndexRecord events), but it should be a hook so that:

  1. Embeddings are computed before the record is minted (same as all other derived data)
  2. Embedding vectors are stored in feature tables like other hook outputs
  3. The search index reads from feature tables rather than computing embeddings on the fly
  4. The entire IndexRecordVectorIndexHandler / KeywordIndexHandler fan-out machinery can be removed (tracked in refactor: event system overhaul — consumer groups, decoupled domains, simplified pipeline #68)

Current state

  • FanOutToIndexBackends handler consumes RecordPublished, emits IndexRecord per backend with routing keys
  • VectorIndexHandler (routing_key="vector") calls backend.ingest_batch() with sentence-transformers
  • KeywordIndexHandler (routing_key="keyword") indexes metadata text
  • ChromaDB is the vector backend

Target state

  • A vector embedding hook (OCI container) runs during validation, producing features.json with embedding vectors
  • Embeddings are stored in the hook's feature table via InsertRecordFeatures
  • Search queries read from the feature table (or a materialized view / search index built from it)
  • The index domain's fan-out pattern is removed entirely

Depends on

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationrefactorInternal restructuring, no behavior change

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions