Skip to content

saeedashraf/CrawlStudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrawlStudio

Unified wrapper for web crawling tools, inspired by modular, community-driven design.

Vision

CrawlStudio provides a unified Python API for various web crawling backends including Firecrawl, Crawl4AI, Scrapy, and Browser-Use (AI-driven). It emphasizes modularity, ease of use, and intelligent extraction capabilities.

Installation

pip install crawlstudio

Usage Examples

Firecrawl Example

import asyncio
from crawlstudio import CrawlConfig, FirecrawlBackend

async def main():
    config = CrawlConfig()
    backend = FirecrawlBackend(config)
    result = await backend.crawl("https://www.bloomberg.com/", format="markdown")
    print(result.markdown)

asyncio.run(main())

Crawl4AI Example

import asyncio
from crawlstudio import CrawlConfig, Crawl4AIBackend

async def main():
    config = CrawlConfig()
    backend = Crawl4AIBackend(config)
    result = await backend.crawl("https://finance.yahoo.com/", format="structured")
    print(result.structured_data)  # Outputs title, summary, keywords

asyncio.run(main())

Scrapy Example

import asyncio
from crawlstudio import CrawlConfig, ScrapyBackend

async def main():
    config = CrawlConfig()
    backend = ScrapyBackend(config)
    result = await backend.crawl("https://www.bloomberg.com/", format="html")
    print(result.raw_html)

asyncio.run(main())

Browser-Use (AI-Driven) Example

import asyncio
from crawlstudio import CrawlConfig, BrowserUseBackend

async def main():
    config = CrawlConfig()
    backend = BrowserUseBackend(config)
    result = await backend.crawl("https://example.com", format="structured")
    print(result.structured_data)  # AI-extracted data

asyncio.run(main())

Note: Browser-Use backend requires pip install browser-use and an AI API key (OpenAI or Anthropic). See BROWSER_USE_SETUP.md for details.

Backend Comparison

Backend Speed Cost AI Intelligence Best For
Firecrawl ⚡ Fast API costs Medium Production scraping
Crawl4AI 🐌 Medium Free Medium Development & testing
Scrapy 🚀 Fastest Free Low Simple HTML extraction
Browser-Use 🧠 Slower AI costs High Complex dynamic sites

Future Enhancements

🔄 Recursive Crawling (Planned)

# Future API - configurable depth and page limits
config = CrawlConfig(
    max_depth=3,                    # Crawl up to 3 levels deep
    max_pages_per_level=5,          # Max 5 pages per depth level
    recursive_delay=1.0,            # 1 second delay between requests
    follow_external_links=False     # Stay within same domain
)

# Recursive crawling with depth control
result = await backend.crawl_recursive("https://example.com", format="markdown")
print(f"Crawled {len(result.pages)} pages across {result.max_depth_reached} levels")

🚀 Additional Crawler Backends (Roadmap)

High Priority

Specialized Crawlers

  • Apify SDK - Cloud scraping platform
  • Colly (via Python bindings) - High-performance Go crawler
  • Puppeteer (via pyppeteer) - Headless Chrome control

AI-Enhanced Crawlers

Enterprise/Commercial

🛠️ Advanced Features (Future Versions)

  • Multi-page crawling with link discovery
  • Batch processing for multiple URLs
  • CLI tool (crawlstudio crawl <url>)
  • Content deduplication and similarity detection
  • Rate limiting and respectful crawling policies
  • Caching system with Redis/disk storage
  • Webhook integrations for real-time notifications
  • GraphQL API for programmatic access
  • Docker containerization for easy deployment

🎯 10K GitHub Stars Roadmap

  1. Core Features (Current): 4 working backends
  2. Recursive Crawling: Depth-based multi-page crawling
  3. CLI Tool: pip install crawlstudio → command line usage
  4. Additional Backends: Playwright, Selenium, BeautifulSoup
  5. Enterprise Features: Batch processing, advanced caching
  6. AI Integration: More AI-powered extraction capabilities
  7. Cloud Platform: SaaS offering with web interface

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages