RAG Pipelines
Your AI hallucinates when its context is stale
RAG quality is a function of what goes into your vector store. If you embedded those docs three months ago and the source has changed since, your retrieval layer is returning outdated context. Your model generates confidently wrong answers, and your users cannot tell the difference. Spider gives you clean, current markdown so your re-embedding pipeline always works from the latest source.
from spider import Spider
client = Spider()
pages = client.crawl_url(
"https://docs.example.com",
params={
"return_format": "markdown",
"limit": 500,
}
)
# 342 pages in 4.2s
for page in pages:
chunks = split(page["content"])
vectors = embed(chunks)
db.upsert(vectors, source=page["url"]) The freshness problem
Embeddings do not update themselves. The moment your source docs change, your retrieval layer starts drifting from reality.
Stale pipeline
Crawled docs, built embeddings. Everything is accurate.
Source docs updated. Your embeddings still reflect January.
API endpoints changed. AI cites deprecated methods.
Users get wrong answers. Trust erodes. They stop asking.
Incremental updates
Crawled docs, built embeddings. Everything is accurate.
You re-crawl with Spider and compare content hashes on your side. 12 pages changed, so only those get re-embedded.
Scheduled re-crawl catches API changes within hours. Your pipeline re-embeds only what changed.
AI answers match current documentation. Users trust the system.
What your embeddings actually see
Embedding models treat all input tokens equally. If half the tokens are navigation links, cookie banners, and ad scripts, your similarity search is matching on noise. Clean input is the single highest-leverage improvement you can make to retrieval quality.
<!DOCTYPE html>
<html><head>
<script src="/analytics.js"></script>
<script src="/hotjar.js"></script>
</head><body>
<nav class="main-nav">
<a href="/">Home</a>
<a href="/docs">Docs</a>
<a href="/pricing">Pricing</a>
<a href="/blog">Blog</a>
<a href="/login">Sign In</a>
</nav>
<div class="cookie-banner">
We use cookies...
</div>
<div class="sidebar">
<a href="/docs/intro">Introduction</a>
<a href="/docs/auth">Authentication</a>
<a href="/docs/api">API Reference</a>
... 40 more sidebar links ...
</div>
<article>
<h1>Authentication</h1>
<p>Pass your API key as a
Bearer token in the header.</p>
</article>
<footer>
... 200 lines of footer ...
</footer>
<script src="/chat-widget.js"></script>
</body></html> # Authentication
Pass your API key as a Bearer token
in the Authorization header.
```bash
curl https://api.example.com/v1/data \
-H "Authorization: Bearer sk-your-key"
```
## Rate Limits
Each key allows 1,000 requests per
minute. Exceeding this returns a
`429 Too Many Requests` response.
## Error Handling
All errors follow a standard format:
```json
{
"error": {
"code": "rate_limited",
"message": "Retry after 60s"
}
}
```
---
source: docs.example.com/auth
crawled: 2026-04-02T08:14:22Z Token reduction varies by page. Content-heavy documentation pages typically see 80-95% fewer tokens after Spider strips navigation, scripts, and boilerplate. The result: more pages fit in your context window and every chunk carries actual meaning.