Skip to main content
NEW AI Studio is now available Try it now

RAG Pipelines

Your AI hallucinates when its context is stale

RAG quality is a function of what goes into your vector store. If you embedded those docs three months ago and the source has changed since, your retrieval layer is returning outdated context. Your model generates confidently wrong answers, and your users cannot tell the difference. Spider gives you clean, current markdown so your re-embedding pipeline always works from the latest source.

The freshness problem

Embeddings do not update themselves. The moment your source docs change, your retrieval layer starts drifting from reality.

Stale pipeline

Jan

Crawled docs, built embeddings. Everything is accurate.

Feb

Source docs updated. Your embeddings still reflect January.

Mar

API endpoints changed. AI cites deprecated methods.

Apr

Users get wrong answers. Trust erodes. They stop asking.

Incremental updates

Jan

Crawled docs, built embeddings. Everything is accurate.

Feb

You re-crawl with Spider and compare content hashes on your side. 12 pages changed, so only those get re-embedded.

Mar

Scheduled re-crawl catches API changes within hours. Your pipeline re-embeds only what changed.

Apr

AI answers match current documentation. Users trust the system.

What your embeddings actually see

Embedding models treat all input tokens equally. If half the tokens are navigation links, cookie banners, and ad scripts, your similarity search is matching on noise. Clean input is the single highest-leverage improvement you can make to retrieval quality.

Without Spider
<!DOCTYPE html>
<html><head>
<script src="/analytics.js"></script>
<script src="/hotjar.js"></script>
</head><body>
<nav class="main-nav">
  <a href="/">Home</a>
  <a href="/docs">Docs</a>
  <a href="/pricing">Pricing</a>
  <a href="/blog">Blog</a>
  <a href="/login">Sign In</a>
</nav>
<div class="cookie-banner">
  We use cookies...
</div>
<div class="sidebar">
  <a href="/docs/intro">Introduction</a>
  <a href="/docs/auth">Authentication</a>
  <a href="/docs/api">API Reference</a>
  ... 40 more sidebar links ...
</div>
<article>
  <h1>Authentication</h1>
  <p>Pass your API key as a
  Bearer token in the header.</p>
</article>
<footer>
  ... 200 lines of footer ...
</footer>
<script src="/chat-widget.js"></script>
</body></html>
~3,200 tokens (this example) | most tokens are boilerplate

Token reduction varies by page. Content-heavy documentation pages typically see 80-95% fewer tokens after Spider strips navigation, scripts, and boilerplate. The result: more pages fit in your context window and every chunk carries actual meaning.

    Drop into your existing stack

    Spider ships as a native document loader for LangChain and LlamaIndex. Call loader.load() , get back documents with metadata already attached, and upsert them into your vector store. No parsing. No glue code. Each document includes its source URL and crawl timestamp for attribution.

    LangChain LlamaIndex CrewAI MCP
    from langchain_community.document_loaders import SpiderLoader
    
    loader = SpiderLoader(
        url="https://docs.example.com",
        api_key="your-api-key",
        mode="crawl",
    )
    
    documents = loader.load()
    
    # .page_content = clean markdown
    # .metadata = source URL, title, timestamp
    vector_store.add_documents(documents)

    What happens at scale

    RAG pipelines often cover large documentation sites with thousands of pages. Here is what to expect when you scale up.

    Streaming delivery

    Use lazy_load() with the LangChain loader or stream results from the API. Pages arrive as they are crawled, so your embedding pipeline can start processing without waiting for the full crawl to finish.

    Webhook notifications

    Set a webhook URL and Spider POSTs results as pages complete. Useful for large crawls where you want to decouple the crawl from your ingestion pipeline. Events include on_find and on_website_status.

    Cost at volume

    Spider charges per page based on bandwidth and compute. Crawling 10,000 documentation pages in markdown mode costs a few dollars depending on page size. Check the pricing page for current rates.

    GUIDE

    Building a RAG Scraper

    From first crawl to production pipeline. Step by step.

    DOCS

    LangChain Integration

    Use Spider as a document loader in LangChain.

    DOCS

    LlamaIndex Integration

    Spider Reader for LlamaIndex pipelines.

    Stop building crawlers. Ship your AI.

    Your retrieval layer is only as good as its data. Start feeding it clean, structured, and current web content today.