Unified wrapper for web crawling tools, inspired by modular, community-driven design.
CrawlStudio provides a unified Python API for various web crawling backends including Firecrawl, Crawl4AI, Scrapy, and Browser-Use (AI-driven). It emphasizes modularity, ease of use, and intelligent extraction capabilities.
pip install crawlstudioimport asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend
async def main():
config = CrawlConfig()
backend = FirecrawlBackend(config)
result = await backend.crawl("https://www.bloomberg.com/", format="markdown")
print(result.markdown)
asyncio.run(main())import asyncio
from crawlstudio import CrawlConfig, Crawl4AIBackend
async def main():
config = CrawlConfig()
backend = Crawl4AIBackend(config)
result = await backend.crawl("https://finance.yahoo.com/", format="structured")
print(result.structured_data) # Outputs title, summary, keywords
asyncio.run(main())import asyncio
from crawlstudio import CrawlConfig, ScrapyBackend
async def main():
config = CrawlConfig()
backend = ScrapyBackend(config)
result = await backend.crawl("https://www.bloomberg.com/", format="html")
print(result.raw_html)
asyncio.run(main())import asyncio
from crawlstudio import CrawlConfig, BrowserUseBackend
async def main():
config = CrawlConfig()
backend = BrowserUseBackend(config)
result = await backend.crawl("https://example.com", format="structured")
print(result.structured_data) # AI-extracted data
asyncio.run(main())Note: Browser-Use backend requires
pip install browser-useand an AI API key (OpenAI or Anthropic). See BROWSER_USE_SETUP.md for details.
| Backend | Speed | Cost | AI Intelligence | Best For |
|---|---|---|---|---|
| Firecrawl | ⚡ Fast | API costs | Medium | Production scraping |
| Crawl4AI | 🐌 Medium | Free | Medium | Development & testing |
| Scrapy | 🚀 Fastest | Free | Low | Simple HTML extraction |
| Browser-Use | 🧠 Slower | AI costs | High | Complex dynamic sites |
# Future API - configurable depth and page limits
config = CrawlConfig(
max_depth=3, # Crawl up to 3 levels deep
max_pages_per_level=5, # Max 5 pages per depth level
recursive_delay=1.0, # 1 second delay between requests
follow_external_links=False # Stay within same domain
)
# Recursive crawling with depth control
result = await backend.crawl_recursive("https://example.com", format="markdown")
print(f"Crawled {len(result.pages)} pages across {result.max_depth_reached} levels")- Playwright - Fast browser automation, excellent for SPAs
- Selenium - Industry standard, huge ecosystem
- BeautifulSoup + Requests - Lightweight, simple parsing
- Apify SDK - Cloud scraping platform
- Colly (via Python bindings) - High-performance Go crawler
- Puppeteer (via pyppeteer) - Headless Chrome control
- ScrapeGraphAI - LLM-powered scraping
- AutoScraper - Machine learning-based pattern detection
- WebGPT - GPT-powered web interaction
- ScrapingBee - Anti-bot bypass service
- Bright Data - Proxy + scraping platform
- Zyte - Enterprise web data platform
- Multi-page crawling with link discovery
- Batch processing for multiple URLs
- CLI tool (
crawlstudio crawl <url>) - Content deduplication and similarity detection
- Rate limiting and respectful crawling policies
- Caching system with Redis/disk storage
- Webhook integrations for real-time notifications
- GraphQL API for programmatic access
- Docker containerization for easy deployment
- Core Features (Current): 4 working backends
- Recursive Crawling: Depth-based multi-page crawling
- CLI Tool:
pip install crawlstudio→ command line usage - Additional Backends: Playwright, Selenium, BeautifulSoup
- Enterprise Features: Batch processing, advanced caching
- AI Integration: More AI-powered extraction capabilities
- Cloud Platform: SaaS offering with web interface