Site Sense

This project provides an API for analyzing and classifying websites. It uses FastAPI to expose endpoints for submitting websites and retrieving classification results. The analysis pipeline is orchestrated by Apache Airflow with pure Python web scraping.

Features

API Endpoints: Submit a website URL for analysis and retrieve classification results.
Airflow Orchestration: Website analysis and classification are managed as Airflow DAGs.
Pure Python Web Scraping: Uses httpx and BeautifulSoup for website crawling (no Scrapy dependency).
LLM Analysis: OpenAI integration for website summarization and categorization.
Git Integration: Project and pipeline are version controlled with Git.

Architecture

The system uses a pure Airflow approach:

Website Record Creation: Creates/updates website records in PostgreSQL
Web Crawling: Pure Python crawler using httpx and BeautifulSoup
Content Processing: Optional post-processing of crawled content
LLM Analysis: OpenAI-powered website summarization and categorization

Setup

Install dependencies:
```
pip install -r requirements.txt
```

Set up the database:

python migrations/001_create_complete_schema.py

Set environment variables:

export OPENAI_API_KEY="your-openai-api-key"

Initialize Airflow (see Airflow docs for setup).
Run the FastAPI app:
```
uvicorn webapp.main:app --reload
```

Database Schema

The application uses PostgreSQL with two main tables:

Websites Table

id: Primary key
domain: Unique domain identifier
name: Website name (extracted from title)
summary: LLM-generated website description
category: LLM-generated industry category
created_at: Timestamp when record was created
updated_at: Timestamp when summary/category was last updated

Webpages Table

id: Primary key
website_id: Foreign key to websites table
url: Unique URL identifier
title: Page title
content: Page content (text from paragraphs)
created_at: Timestamp when record was created

Usage

POST /analyze with { "url": "https://example.com" } to submit a website for analysis.
GET /result?url=https://example.com to retrieve the classification result.

Benefits of Pure Airflow Approach

Simplified Architecture: Single orchestration platform instead of Scrapy + Airflow
Reduced Dependencies: No need for Scrapy framework and its dependencies
Better Integration: Direct database access from Airflow tasks
Easier Debugging: All logic in Python functions within Airflow
Consistent Error Handling: Airflow's built-in retry and monitoring capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
airflow/dags		airflow/dags
migrations		migrations
webapp		webapp
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Site Sense

Features

Architecture

Setup

Database Schema

Websites Table

Webpages Table

Usage

Benefits of Pure Airflow Approach

About

Uh oh!

Releases

Packages

Languages

dwhogan/site-sense

Folders and files

Latest commit

History

Repository files navigation

Site Sense

Features

Architecture

Setup

Database Schema

Websites Table

Webpages Table

Usage

Benefits of Pure Airflow Approach

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages