Unified wrapper for web crawling tools, inspired by modular, community-driven design.
CrawlStudio provides a unified Python API for various web crawling backends including Firecrawl, Crawl4AI, Scrapy, and Browser-Use (AI-driven). It emphasizes modularity, ease of use, and intelligent extraction capabilities.
pip install crawlstudioFrom source (recommended for contributors):
git clone https://github.com/aiapicore/CrawlStudio.git
cd CrawlStudio
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e .[dev]Optional extras for AI browser backend:
pip install .[browser-use]- CLI:
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend firecrawl --format markdown --print markdown- Python:
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend
async def main():
cfg = CrawlConfig()
res = await FirecrawlBackend(cfg).crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(res.markdown)
asyncio.run(main())Create a .env in the project root if using external services/backends:
FIRECRAWL_API_KEY=your_firecrawl_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_keyFor Browser-Use backend, you must set at least one of OPENAI_API_KEY or ANTHROPIC_API_KEY.
If you use headless browsers (via browser-use), install Playwright runtime:
python -m pip install playwright
python -m playwright installAfter install, use the CLI to crawl a URL with different backends and formats:
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend firecrawl --format markdown --print markdown
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend crawl4ai --format html --print html
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend scrapy --format markdown --print markdown
crawlstudio https://en.wikipedia.org/wiki/Switzerland --backend browser-use --format markdown --print markdown--backend: one offirecrawl,crawl4ai,scrapy,browser-use--format: one ofmarkdown,html,structured--print: choose what to print:summary(default),markdown,html,structured
The library exposes a unified interface; below are end-to-end examples for each backend.
import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend
async def main():
config = CrawlConfig()
backend = FirecrawlBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(result.markdown)
asyncio.run(main())import asyncio
from crawlstudio import CrawlConfig, Crawl4AIBackend
async def main():
config = CrawlConfig()
backend = Crawl4AIBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(result.markdown) # Outputs title, summary, keywords
asyncio.run(main())import asyncio
from crawlstudio import CrawlConfig, ScrapyBackend
async def main():
config = CrawlConfig()
backend = ScrapyBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="html")
print(result.raw_html)
asyncio.run(main())import asyncio
from crawlstudio import CrawlConfig, BrowserUseBackend
async def main():
config = CrawlConfig()
backend = BrowserUseBackend(config)
result = await backend.crawl("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(result.markdown) # AI-extracted data
asyncio.run(main())Run the test suite (pytest) and local checks (flake8, mypy):
pytest -q
flake8
mypy crawlstudioNotes:
- We target Python 3.10+ for typing (PEP 604
X | Yunions). - Third-party libraries without type stubs are ignored by mypy (
ignore_missing_imports = true).
- Fork and clone the repo, create a virtual env, then install dev deps:
git clone https://github.com/aiapicore/CrawlStudio.git
cd CrawlStudio
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
pip install -e .[dev]- Optional: install pre-commit hooks
pip install pre-commit
pre-commit install
pre-commit run --all-files- Run the suite before submitting a PR:
flake8
mypy crawlstudio
pytest -q| Backend | Speed | Cost | AI Intelligence | Best For |
|---|---|---|---|---|
| Firecrawl | Fast | API costs | Medium | Production scraping |
| Crawl4AI | Medium | Free | Medium | Development & testing |
| Scrapy | Fastest | Free | Low | Simple HTML extraction |
| Browser-Use | Slower | AI costs | High | Complex dynamic sites |
# Future API - configurable depth and page limits
config = CrawlConfig(
max_depth=3, # Crawl up to 3 levels deep
max_pages_per_level=5, # Max 5 pages per depth level
recursive_delay=1.0, # 1 second delay between requests
follow_external_links=False # Stay within same domain
)
# Recursive crawling with depth control
result = await backend.crawl_recursive("https://en.wikipedia.org/wiki/Switzerland", format="markdown")
print(f"Crawled {len(result.pages)} pages across {result.max_depth_reached} levels")- Playwright - Fast browser automation, excellent for SPAs
- Selenium - Industry standard, huge ecosystem
- BeautifulSoup + Requests - Lightweight, simple parsing
- Apify SDK - Cloud scraping platform
- Colly (via Python bindings) - High-performance Go crawler
- Puppeteer (via pyppeteer) - Headless Chrome control
- ScrapeGraphAI - LLM-powered scraping
- AutoScraper - Machine learning-based pattern detection
- WebGPT - GPT-powered web interaction
- ScrapingBee - Anti-bot bypass service
- Bright Data - Proxy + scraping platform
- Zyte - Enterprise web data platform
- Multi-page crawling with link discovery
- Batch processing for multiple URLs
- CLI tool (
crawlstudio crawl <url>) - Content deduplication and similarity detection
- Rate limiting and respectful crawling policies
- Caching system with Redis/disk storage
- Webhook integrations for real-time notifications
- GraphQL API for programmatic access
- Docker containerization for easy deployment
- Core Features (Current): 4 working backends
- Recursive Crawling: Depth-based multi-page crawling
- CLI Tool:
pip install crawlstudioโ command line usage - Additional Backends: Playwright, Selenium, BeautifulSoup
- Enterprise Features: Batch processing, advanced caching
- AI Integration: More AI-powered extraction capabilities
- Cloud Platform: SaaS offering with web interface