A semantic PDF retrieval system that parses PDFs into searchable chunks and provides an AI-powered agent for document Q&A with citations.
pdf-tree takes a PDF document, extracts its content into semantically meaningful chunks using the Extend API, builds a searchable index, and exposes a conversational REPL where you can ask questions about the document. The agent retrieves relevant chunks and provides answers with precise citations (chunk IDs and page numbers).
┌─────────────────────────────────────────────────────────────────────────┐
│ PDF Document │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Extend Parse API │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ config: │ │
│ │ target: "markdown" │ │
│ │ chunkingStrategy: { type: "section", min: 500, max: 2000 } │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Indexer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Chunks │ │ Headings │ │ Sections │ │
│ │ (semantic) │ │ (hierarchy) │ │ (numbered) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └─────────────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ semantic-tree.json │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Agent (GPT-4.1) │
│ │
│ Tools: │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ get_doc_summary│ │ search_headings│ │ search_sections│ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ ┌────────────────┐ │
│ │ get_chunks │ │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ REPL │
│ │
│ You: What does Section 15 say about liability? │
│ Agent: [Calling search_headings...] │
│ [Calling search_sections...] │
│ [Calling get_chunks...] │
│ Section 15 covers limitation of liability... │
│ [chunk-42, pages 12-13] │
└─────────────────────────────────────────────────────────────────────────┘
semantic-tree.json
├── title: string # Document title
├── summary: string # AI-generated summary
├── chunks[] # Semantic chunks (500-2000 chars)
│ ├── id: string
│ ├── content: string # Markdown content
│ ├── pageStart: number
│ ├── pageEnd: number
│ └── blockIds: string[] # References to granular blocks
├── blocks[] # Raw extracted blocks
│ ├── id: string
│ ├── content: string
│ ├── page: number
│ ├── type: string
│ └── boundingBox: {...}
├── headings[] # Document structure
│ ├── heading: string
│ ├── level: number
│ └── chunkIds: string[]
└── sections[] # Numbered sections (e.g., "4.7")
├── section: string
└── chunkIds: string[]
- Install dependencies:
bun install- Set environment variables:
export EXTEND_API_KEY="your-extend-api-key"
export OPENAI_API_KEY="your-openai-api-key"bun run src/index.ts <pdf-url>This parses the PDF, extracts semantic chunks, and writes semantic-tree.json.
bun run index.tsThis starts the REPL. Available commands:
- Type a question to ask the agent
/debug- Show chat history/clear- Clear chat history/quit- Exit
-
Parsing: The Extend API parses the PDF with section-based chunking, creating semantic chunks of 500-2000 characters that respect markdown structure (headings, paragraphs, tables).
-
Indexing: The indexer extracts:
- Title and summary (via GPT-4.1)
- Headings with hierarchy levels
- Section references (e.g., "Section 4.7") mapped to chunks
-
Retrieval: The agent uses tools to:
- Get document overview (
get_doc_summary) - Search headings by keyword (
search_headings) - Look up numbered sections (
search_sections) - Retrieve chunk content (
get_chunks)
- Get document overview (
-
Answer Generation: The agent synthesizes retrieved chunks into answers with citations pointing to specific chunks and page numbers.
MIT