ContextFrame is a Python package and file specification designed to streamline context management for Large Language Models (LLMs) in applied AI/ML workflows. It builds upon the powerful .lance columnar data format, providing a standardized schema, validation tools, and user-friendly abstractions to enhance its usability.
Effectively managing the context provided to LLMs is crucial for generating relevant and accurate responses. This often involves handling diverse documents, maintaining structured metadata, tracking relationships, ensuring data consistency, and crucially, annotating content with user-specific context relevant to the task at hand – tasks that can become complex in real-world applications.
ContextFrame addresses these challenges by leveraging the .lance format's strengths (performance, versioning, vector support) and adding a layer of structure and semantics inspired by the original MDP project:
- Standardized Schema: Defines a specific, extensible schema (
contextframe/schema/contextframe_schema.json) for storing document content, rich metadata (including authorship, timestamps, status, source information), relationships, collection linkage, and vector embeddings within a.lancefile. This ensures consistency and provides a common structure for LLM context.- Record Type (
record_type): An enum column distinguishing between normaldocumentrows and special control rows likecollection_headerordataset_header. This enables efficient filtering and makes datasets self-describing. - Relationships: The schema includes a dedicated
relationshipsfield (a list of structs) to explicitly link related documents (e.g., parent/child, related, reference, member_of) using various identifiers (UUID, URI, Path, CID). - Collections: Fields like
collection,collection_id, andpositionfacilitate organizing documents into logical groups or sequences. - Source Tracking: Fields like
source_file,source_type, andsource_urlallow tracking the origin of content, useful for ingestion pipelines. - User-Defined Context: Fields like
context(string) andcustom_metadata(map) are explicitly designed to be editable by the user, allowing them to add workflow-specific annotations, descriptions of relevance (e.g., "This document is crucial for Project 123 because..."), or task-specific labels beyond the original source metadata.
- Record Type (
- Schema-Aware Abstractions: Offers Python classes (
FrameRecord,FrameDataset) that simplify interactions with Lance datasets conforming to the ContextFrame spec.FrameRecord: Represents a single document (row) in memory. It handles data validation against the schema, manages default values (like UUIDs and timestamps), provides convenience accessors and allows direct modification of common metadata fields (includingcontextandcustom_metadata), includes helpers for managing relationships (e.g.,add_relationship), and converts data to/from the Arrow format required by Lance. This enables users to easily layer their own semantic context onto the data.FrameDataset: Wraps the corelance.LanceDataset, providing schema-aware methods for common dataset operations:- Creation (
.create): Initializes a new.lancedirectory with the correct ContextFrame schema (usinglance.write_dataset). - Opening (
.open): Loads an existing ContextFrame dataset (usinglance.dataset). - Adding Records (
.add,.add_many): ValidatesFrameRecordmetadata against the schema and inserts the data into the Lance dataset (usingLanceDataset.insert). - Retrieval (
.get_by_uuid): Fetches a specific record by UUID (usingLanceDataset.scannerwith a filter). - Vector & Text Search Helpers
knn_search(...)– convenient nearest-neighbour search that returns fully-materialisedFrameRecordobjects and supports additional SQL filters.full_text_search(...)– wraps Lance BM25 full-text search for one-liner relevance queries.
- Rich Query Helpers – pre-built helpers for common metadata look-ups (e.g.
find_by_status,find_by_tag,find_related_to,find_by_author, date-range helpers, etc.). - Record Management (
.update_record,.delete_record,.upsert_record): Idiomatic helpers to update, remove, or upsert records by UUID while preserving schema integrity. - Collection Helpers (
.get_collection_header,.get_collection_members,.get_members_linked_to_header): Convenience methods to work with logical document collections identified bycollection+record_type='collection_header'.
- Creation (
- Validation: Integrates metadata validation using the JSON schema before writing data, preventing inconsistencies.
- Focus on Usability: Aims to enhance the developer experience for managing context data within
.lancefiles, without reimplementing Lance's core file I/O or advanced query capabilities. The underlyingLanceDatasetis always accessible (FrameDataset._native) for advanced use cases.
- Lance-Based: Built on the high-performance, version-aware
.lanceformat. - Rich, Editable & Standardized Metadata: Comprehensive schema for essential document details, including provenance, collections, status, and importantly, user-editable fields (
context,custom_metadata) for adding workflow-specific relevance and annotations. - Optional Raw Multimodal Data Store raw binary data (images, audio, video, PDFs, etc.) directly in the dataset using the
raw_data(bytes) andraw_data_type(IANA MIME type) columns added to the canonical schema. A helperMimeTypesenum undercontextframe.schemaexposes common constants (e.g.,MimeTypes.IMAGE_JPEG). - Explicit Record Typing The
RecordTypeenum (contextframe.schema.RecordType) provides canonical constants (DOCUMENT,COLLECTION_HEADER,DATASET_HEADER,FRAMESET). - Explicit Relationships: Dedicated schema support and helpers for defining connections between documents.
- Vector Embedding Support: Natively handles vector embeddings alongside text and metadata.
- Schema Validation: Ensures data consistency and integrity against the ContextFrame specification.
- Pythonic API: Provides intuitive classes (
FrameRecord,FrameDataset) for interacting with ContextFrame data. - Full CRUD Helpers: Built-in
.update_record,.delete_record, and.upsert_recordmethods enable convenient record modifications. - Document Extraction: Native extractors for common formats (text, JSON, CSV, YAML) with lightweight Rust-based chunking via
semantic-text-splitter. Supports integration patterns for advanced extraction tools (Docling, Unstructured.io, Chunkr, Reducto). - Embedding Generation: Unified embedding interface through LiteLLM supporting 100+ providers (OpenAI, Anthropic, Cohere, local models via Ollama, etc.) with automatic batching and fallback strategies.
- Lightweight Abstraction: Focuses on schema, validation, and usability, relying on the underlying
lancelibrary for core storage and query operations.
The contextframe package provides the core specification, schema, validation rules, and Python abstractions (FrameRecord, FrameDataset). It requires the lance library as a dependency.
While ContextFrame defines fields to track document source and relationships, the actual data conversion/ingestion (e.g., from PDF, HTML, other formats) and complex relationship graph analysis are considered extended functionalities. Users should leverage other libraries or build specific tools that produce or consume ContextFrame-compliant .lance files for these purposes.
# Core functionality only - fast, lightweight install
pip install contextframe==0.1.2Important: Use version 0.1.2 or later for correct dependencies (pylance).
# Document extraction (PDF, DOCX, HTML, PPTX, XLSX)
pip install "contextframe[extract]"
# Embedding generation (OpenAI, Anthropic, Cohere, Sentence Transformers)
pip install "contextframe[embed]"
# AI-powered enhancement (summarization, metadata extraction)
pip install "contextframe[enhance]"
# MP4 encoding for universal distribution
pip install "contextframe[encode]"
# MCP server for LLM/agent integration
pip install "contextframe[serve]"
# Install everything
pip install "contextframe[all]"- Just managing Lance datasets? → Use core package
- Need to extract content from documents? → Add
[extract] - Creating embeddings? → Add
[embed] - Using LLMs for enrichment? → Add
[enhance] - Distributing datasets as MP4? → Add
[encode] - Serving to LLMs via MCP? → Add
[serve]
from contextframe import FrameRecord, FrameDataset
from contextframe.schema.contextframe_schema import MimeTypes, RecordType
import numpy as np
import pyarrow as pa # Lance works closely with Arrow
# 1. Create FrameRecord objects (in-memory representation)
# - Handles metadata validation and default values (e.g., uuid, dates)
# - Prepares data for Lance/Arrow
# - Includes helpers for relationships
record1 = FrameRecord.create(
title="Introduction to Lance",
content="Lance is a modern columnar format...",
vector=np.random.rand(1536).astype(np.float32), # Example embedding
author="LanceDB Team",
tags=["lance", "columnar", "vector db"],
custom_metadata={"project": "ContextFrame"}, # Example custom field
record_type=RecordType.DOCUMENT,
)
# Example with raw embedded image content and collection header
# (demonstrates use of RecordType enum and collection helpers)
image_bytes = b"...binary jpeg bytes..." # Replace with real bytes
# Create a *collection header* document which holds contextual information
collection_header = FrameRecord.create(
title="Project X Docs – Collection Header",
content="Overview and context for Project X documents...",
vector=np.zeros(1536, dtype=np.float32),
tags=["project-x"],
collection="project_x",
record_type=RecordType.COLLECTION_HEADER,
)
record2 = FrameRecord.create(
title="Project X – Image Asset",
content="This record stores an image relevant to Project X...",
vector=np.random.rand(1536).astype(np.float32),
collection="project_x",
raw_data=image_bytes,
raw_data_type=MimeTypes.IMAGE_JPEG,
)
# Later, add or modify context based on your workflow needs
# record1.metadata['context'] = "This is the primary specification document for the ContextFrame project."
# record1.metadata['custom_metadata']['review_status'] = 'approved'
# record1.tags.append('core_concept') # Tags are also mutable
# Add a relationship between records (managed within FrameRecord metadata)
record2.add_relationship(collection_header, relationship_type="member_of")
# FrameRecord handles conversion to Arrow Table format needed by Lance
# table1 = record1.to_table()
# print(table1.schema)
# 2. Use FrameDataset to manage records and interact with Lance files
# - FrameDataset wraps the underlying lance.LanceDataset
dataset_path = "my_context_frames.lance"
# Create a new dataset (writes an empty .lance dir with ContextFrame schema)
# try:
# frame_dataset = FrameDataset.create(dataset_path, embed_dim=1536)
# except FileExistsError: # Handle existing dataset
# frame_dataset = FrameDataset.open(dataset_path)
# Add records (validates schema, converts FrameRecord to Arrow, uses LanceDataset.insert)
# frame_dataset.add(record1)
# frame_dataset.add(record2) # Can add records individually or use add_many
# Open an existing dataset
# existing_dataset = FrameDataset.open(dataset_path)
# print(f"Dataset URI: {existing_dataset._native.uri}")
# print(f"Schema: {existing_dataset._native.schema}")
# print(f"Versions: {existing_dataset.versions()}")
# Advanced query helpers
# 3-nearest neighbours restricted to a collection
# query_vector = np.random.rand(1536).astype(np.float32)
# top3 = existing_dataset.knn_search(query_vector, k=3, filter="collection = 'project_x'")
# print([r.title for r in top3])
# Full-text search across text_content
# hits = existing_dataset.full_text_search("vector database", k=10)
# for rec in hits:
# print(rec.title, rec.metadata.get("score"))
# Metadata helpers
# published_docs = existing_dataset.find_by_status("published")
# tag_matches = existing_dataset.find_by_tag("lance")
# Load a specific record using the helper
# loaded_record = FrameRecord.from_dataset_row(dataset_path, uuid=record1.uuid)
# print(f"Loaded record title: {loaded_record.title}")
# print(f"Loaded record relationships: {loaded_record.metadata.get('relationships')}")
# print(f"Loaded record custom metadata: {loaded_record.metadata.get('custom_metadata')}")
# Access the native LanceDataset object if needed
# native_lance_dataset = existing_dataset._native
# print(f"Native dataset row count: {native_lance_dataset.count_rows()}")
# Update a record (delete-then-add under the hood)
# record1.title = "Introduction to Lance (Updated)"
# frame_dataset.update_record(record1)
# Delete a record by UUID
# frame_dataset.delete_record(record2.uuid)
# Upsert semantics – insert or replace
# frame_dataset.upsert_record(record1)(Please note: The usage example above is conceptual and may require adjustments based on the final API design. It illustrates the intended workflow.)
FrameDataset exposes index management helpers that wrap Lance native functionality so you can optimise searches without reading the docs every time:
# Build (or replace) a vector IVF+PQ index – essential for fast knn_search
frame_dataset.create_vector_index(index_type="IVF_PQ", num_partitions=256, num_sub_vectors=64)
# Create a bitmap index on the 'status' column to speed up filters
frame_dataset.create_scalar_index("status")Behind the scenes these forward to LanceDataset.create_index() / create_scalar_index(). See the Lance docs for tuning parameters.
This project is licensed under the LGPL-3.0-or-later License - see the LICENSE file for details.