This project provides an API for analyzing and classifying websites. It uses FastAPI to expose endpoints for submitting websites and retrieving classification results. The analysis pipeline is orchestrated by Apache Airflow with pure Python web scraping.
- API Endpoints: Submit a website URL for analysis and retrieve classification results.
- Airflow Orchestration: Website analysis and classification are managed as Airflow DAGs.
- Pure Python Web Scraping: Uses httpx and BeautifulSoup for website crawling (no Scrapy dependency).
- LLM Analysis: OpenAI integration for website summarization and categorization.
- Git Integration: Project and pipeline are version controlled with Git.
The system uses a pure Airflow approach:
- Website Record Creation: Creates/updates website records in PostgreSQL
- Web Crawling: Pure Python crawler using httpx and BeautifulSoup
- Content Processing: Optional post-processing of crawled content
- LLM Analysis: OpenAI-powered website summarization and categorization
- Install dependencies:
pip install -r requirements.txt
- Set up the database:
python migrations/001_create_complete_schema.py
- Set environment variables:
export OPENAI_API_KEY="your-openai-api-key"
- Initialize Airflow (see Airflow docs for setup).
- Run the FastAPI app:
uvicorn webapp.main:app --reload
The application uses PostgreSQL with two main tables:
id: Primary keydomain: Unique domain identifiername: Website name (extracted from title)summary: LLM-generated website descriptioncategory: LLM-generated industry categorycreated_at: Timestamp when record was createdupdated_at: Timestamp when summary/category was last updated
id: Primary keywebsite_id: Foreign key to websites tableurl: Unique URL identifiertitle: Page titlecontent: Page content (text from paragraphs)created_at: Timestamp when record was created
- POST
/analyzewith{ "url": "https://example.com" }to submit a website for analysis. - GET
/result?url=https://example.comto retrieve the classification result.
- Simplified Architecture: Single orchestration platform instead of Scrapy + Airflow
- Reduced Dependencies: No need for Scrapy framework and its dependencies
- Better Integration: Direct database access from Airflow tasks
- Easier Debugging: All logic in Python functions within Airflow
- Consistent Error Handling: Airflow's built-in retry and monitoring capabilities