A collection of practical LLM projects for learning and experimentation.
We're surrounded by information, but often lack the time to digest it all. Whether you're researching competitors, staying on top of industry news, evaluating tools, or simply trying to understand what a company does before a meeting - manually reading through websites is time-consuming.
This project explores how LLMs can help us quickly extract insights from web content while keeping everything local and private.
- Research & Discovery - Quickly understand what a company or product does
- Competitive Analysis - Get the gist of competitor websites without reading every page
- Meeting Prep - Summarize a client's or partner's website before a call
- Content Curation - Evaluate if an article is worth a deep read
- Learning - Understand how to build LLM-powered tools from scratch
- Don't scrape sites that prohibit it - Check robots.txt and Terms of Service
- Don't use for mass scraping - This is for occasional, personal use
- Don't rely on it for critical decisions - LLMs can miss nuance or hallucinate
- Don't scrape login-protected content - Respect access controls
- Don't use commercially without consideration - Website content belongs to its owners
A note on responsibility: Web scraping exists in a gray area. This tool is meant for personal productivity and learning. Be respectful of website owners, don't hammer servers with requests, and always consider whether your use case is ethical and legal.
A web scraper + LLM-powered summarizer with a clean Gradio UI. Paste any URL and get an AI-generated summary.
Features:
- Smart scraping with BeautifulSoup + Playwright fallback (handles JS-heavy sites)
- Streaming responses for real-time output
- Two LLM providers:
- OpenAI (paid) - GPT-4o, GPT-4o-mini, etc.
- Ollama (free) - Llama 3.2, Mistral, Gemma, and more (runs 100% locally)
Learning Path:
- Start with
summarizer_tutorial.ipynb- understand the fundamentals - Explore
scraper.py- see how web scraping works - Check out
app.py- learn how to build a UI with Gradio
# Clone the repository
git clone https://github.com/YOUR_USERNAME/LLM_Projects_Notes.git
cd LLM_Projects_Notes
# Install dependencies with uv (recommended)
uv sync
# Or with pip
pip install -e .# Copy the example env file
cp .env.example .env
# Edit .env and add your API keys
# For OpenAI: Get key at https://platform.openai.com/api-keys# Install browser binaries
playwright install chromium# Activate the virtual environment (if using uv)
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
# Run the Gradio app
python 1_llm_website_summarizer/app.pyOpen http://127.0.0.1:7860 in your browser.
Ollama lets you run LLMs entirely on your machine - no API keys, no costs, complete privacy.
-
Install Ollama: https://ollama.com/download
-
Pull a model:
# Recommended for most machines (4.7GB) ollama pull llama3.2 # Lighter alternative (2.0GB) ollama pull phi3 # More powerful if you have the RAM (8GB+) ollama pull llama3.1
-
Start Ollama server:
ollama serve
-
In the app, select "Ollama" as the provider and choose your model.
LLM_Projects_Notes/
├── 1_llm_website_summarizer/
│ ├── app.py # Gradio web UI (the final product)
│ ├── scraper.py # Web scraping utilities
│ └── summarizer_tutorial.ipynb # Tutorial notebook (start here!)
├── .env.example # Example environment variables
├── .gitignore # Git ignore rules
├── pyproject.toml # Project dependencies
└── README.md # This file
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
For OpenAI | Your OpenAI API key |
ANTHROPIC_API_KEY |
Optional | For future Anthropic support |
In app.py, you can adjust:
max_chars: Maximum characters to scrape (default: 5000)- Playwright fallback is automatic for JS-heavy sites
- Add a new function in
app.pyfollowing the pattern ofsummarize_with_openai()orsummarize_with_ollama() - Add the provider to the dropdown choices
- Update the
summarize_website()function to handle the new provider
Edit the SYSTEM_PROMPT constant in app.py to change how summaries are generated.
from scraper import fetch_website_contents, fetch_website_links
# Get website text content
content = fetch_website_contents("https://example.com", max_chars=5000)
# Get all links on a page
links = fetch_website_links("https://example.com")The scraper includes a fetch_website_links() function that extracts all links from a page. Here are some ideas to build on this project:
Crawl an entire website by following links and summarize each page. Useful for getting a complete picture of a company or product.
from scraper import fetch_website_contents, fetch_website_links
# Get all links from homepage
links = fetch_website_links("https://example.com")
# Filter to same-domain links, then summarize each
for link in links[:10]: # Limit to avoid hammering the server
summary = summarize_website(link)
print(f"## {link}\n{summary}\n")Find broken links or analyze where a website links to (external dependencies, partners, etc.).
Given a topic, scrape multiple sources and generate a consolidated summary with citations.
Periodically scrape a page and use an LLM to identify what changed since last time.
- Make sure you created
.envfile (copy from.env.example) - Check that
OPENAI_API_KEYis set correctly - Restart the app after changing
.env
- Make sure Ollama is installed: https://ollama.com/download
- Start the server:
ollama serve - Check it's running:
ollama list
- Pull the model first:
ollama pull llama3.2 - Check available models:
ollama list
- Some sites are heavily JavaScript-based
- Playwright fallback should handle most cases
- Very dynamic sites (React SPAs) may still be challenging
- API Keys: Never commit
.envfiles. The.gitignoreis configured to exclude them. - Local Only: The Gradio app runs on
127.0.0.1only, not exposed to your network. - No Cloud: This app is designed for local use. No data is sent anywhere except to the LLM provider you choose.
- Private URLs Blocked: The scraper blocks requests to localhost and private IP ranges.
MIT License - feel free to use, modify, and distribute.
Contributions welcome! Please open an issue or PR.