RAG Pipelines

Your AI hallucinates when its context is stale

RAG quality is a function of what goes into your vector store. If you embedded those docs three months ago and the source has changed since, your retrieval layer is returning outdated context. Your model generates confidently wrong answers, and your users cannot tell the difference. Spider gives you clean, current markdown so your re-embedding pipeline always works from the latest source.

Get Started Read the Guide

crawl_and_embed.py

from spider import Spider

client = Spider()
pages = client.crawl_url(
    "https://docs.example.com",
    params={
        "return_format": "markdown",
        "limit": 500,
    }
)

# 342 pages in 4.2s
for page in pages:
    chunks = split(page["content"])
    vectors = embed(chunks)
    db.upsert(vectors, source=page["url"])

The freshness problem

Embeddings do not update themselves. The moment your source docs change, your retrieval layer starts drifting from reality.

Stale pipeline

Jan

Crawled docs, built embeddings. Everything is accurate.

Feb

Source docs updated. Your embeddings still reflect January.

Mar

API endpoints changed. AI cites deprecated methods.

Apr

Users get wrong answers. Trust erodes. They stop asking.

Incremental updates

Jan

Crawled docs, built embeddings. Everything is accurate.

Feb

You re-crawl with Spider and compare content hashes on your side. 12 pages changed, so only those get re-embedded.

Mar

Scheduled re-crawl catches API changes within hours. Your pipeline re-embeds only what changed.

Apr

AI answers match current documentation. Users trust the system.

What your embeddings actually see

Embedding models treat all input tokens equally. If half the tokens are navigation links, cookie banners, and ad scripts, your similarity search is matching on noise. Clean input is the single highest-leverage improvement you can make to retrieval quality.

Without Spider

<!DOCTYPE html>
<html><head>
<script src="/analytics.js"></script>
<script src="/hotjar.js"></script>
</head><body>
<nav class="main-nav">
  <a href="/">Home</a>
  <a href="/docs">Docs</a>
  <a href="/pricing">Pricing</a>
  <a href="/blog">Blog</a>
  <a href="/login">Sign In</a>
</nav>
<div class="cookie-banner">
  We use cookies...
</div>
<div class="sidebar">
  <a href="/docs/intro">Introduction</a>
  <a href="/docs/auth">Authentication</a>
  <a href="/docs/api">API Reference</a>
  ... 40 more sidebar links ...
</div>
<article>
  <h1>Authentication</h1>
  <p>Pass your API key as a
  Bearer token in the header.</p>
</article>
<footer>
  ... 200 lines of footer ...
</footer>
<script src="/chat-widget.js"></script>
</body></html>

~3,200 tokens (this example) | most tokens are boilerplate

With Spider

# Authentication

Pass your API key as a Bearer token
in the Authorization header.

```bash
curl https://api.example.com/v1/data \
  -H "Authorization: Bearer sk-your-key"
```

## Rate Limits

Each key allows 1,000 requests per
minute. Exceeding this returns a
`429 Too Many Requests` response.

## Error Handling

All errors follow a standard format:

```json
{
  "error": {
    "code": "rate_limited",
    "message": "Retry after 60s"
  }
}
```

---
source: docs.example.com/auth
crawled: 2026-04-02T08:14:22Z

~180 tokens (this example) | only semantic content remains

Token reduction varies by page. Content-heavy documentation pages typically see 80-95% fewer tokens after Spider strips navigation, scripts, and boilerplate. The result: more pages fit in your context window and every chunk carries actual meaning.

Your AI hallucinates when its context is stale

The freshness problem

Stale pipeline

Incremental updates

What your embeddings actually see

Drop into your existing stack

What happens at scale

Streaming delivery

Webhook notifications

Cost at volume

Building a RAG Scraper

LangChain Integration

LlamaIndex Integration

Stop building crawlers. Ship your AI.