Skip to content

dwhogan/site-sense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Site Sense

This project provides an API for analyzing and classifying websites. It uses FastAPI to expose endpoints for submitting websites and retrieving classification results. The analysis pipeline is orchestrated by Apache Airflow with pure Python web scraping.

Features

  • API Endpoints: Submit a website URL for analysis and retrieve classification results.
  • Airflow Orchestration: Website analysis and classification are managed as Airflow DAGs.
  • Pure Python Web Scraping: Uses httpx and BeautifulSoup for website crawling (no Scrapy dependency).
  • LLM Analysis: OpenAI integration for website summarization and categorization.
  • Git Integration: Project and pipeline are version controlled with Git.

Architecture

The system uses a pure Airflow approach:

  1. Website Record Creation: Creates/updates website records in PostgreSQL
  2. Web Crawling: Pure Python crawler using httpx and BeautifulSoup
  3. Content Processing: Optional post-processing of crawled content
  4. LLM Analysis: OpenAI-powered website summarization and categorization

Setup

  1. Install dependencies:
    pip install -r requirements.txt
  2. Set up the database:
    python migrations/001_create_complete_schema.py
  3. Set environment variables:
    export OPENAI_API_KEY="your-openai-api-key"
  4. Initialize Airflow (see Airflow docs for setup).
  5. Run the FastAPI app:
    uvicorn webapp.main:app --reload

Database Schema

The application uses PostgreSQL with two main tables:

Websites Table

  • id: Primary key
  • domain: Unique domain identifier
  • name: Website name (extracted from title)
  • summary: LLM-generated website description
  • category: LLM-generated industry category
  • created_at: Timestamp when record was created
  • updated_at: Timestamp when summary/category was last updated

Webpages Table

  • id: Primary key
  • website_id: Foreign key to websites table
  • url: Unique URL identifier
  • title: Page title
  • content: Page content (text from paragraphs)
  • created_at: Timestamp when record was created

Usage

  • POST /analyze with { "url": "https://example.com" } to submit a website for analysis.
  • GET /result?url=https://example.com to retrieve the classification result.

Benefits of Pure Airflow Approach

  • Simplified Architecture: Single orchestration platform instead of Scrapy + Airflow
  • Reduced Dependencies: No need for Scrapy framework and its dependencies
  • Better Integration: Direct database access from Airflow tasks
  • Easier Debugging: All logic in Python functions within Airflow
  • Consistent Error Handling: Airflow's built-in retry and monitoring capabilities

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages