Skip to content

This repository will be used to add all different types of LLMs projects - basic to advanced

License

Notifications You must be signed in to change notification settings

Squib-17/llm_projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Projects & Notes

A collection of practical LLM projects for learning and experimentation.


Why This Project?

We're surrounded by information, but often lack the time to digest it all. Whether you're researching competitors, staying on top of industry news, evaluating tools, or simply trying to understand what a company does before a meeting - manually reading through websites is time-consuming.

This project explores how LLMs can help us quickly extract insights from web content while keeping everything local and private.

When This Is Useful

  • Research & Discovery - Quickly understand what a company or product does
  • Competitive Analysis - Get the gist of competitor websites without reading every page
  • Meeting Prep - Summarize a client's or partner's website before a call
  • Content Curation - Evaluate if an article is worth a deep read
  • Learning - Understand how to build LLM-powered tools from scratch

When NOT to Use This

  • Don't scrape sites that prohibit it - Check robots.txt and Terms of Service
  • Don't use for mass scraping - This is for occasional, personal use
  • Don't rely on it for critical decisions - LLMs can miss nuance or hallucinate
  • Don't scrape login-protected content - Respect access controls
  • Don't use commercially without consideration - Website content belongs to its owners

A note on responsibility: Web scraping exists in a gray area. This tool is meant for personal productivity and learning. Be respectful of website owners, don't hammer servers with requests, and always consider whether your use case is ethical and legal.


Projects

1. Website Summarizer (1_llm_website_summarizer/)

A web scraper + LLM-powered summarizer with a clean Gradio UI. Paste any URL and get an AI-generated summary.

Features:

  • Smart scraping with BeautifulSoup + Playwright fallback (handles JS-heavy sites)
  • Streaming responses for real-time output
  • Two LLM providers:
    • OpenAI (paid) - GPT-4o, GPT-4o-mini, etc.
    • Ollama (free) - Llama 3.2, Mistral, Gemma, and more (runs 100% locally)

Learning Path:

  1. Start with summarizer_tutorial.ipynb - understand the fundamentals
  2. Explore scraper.py - see how web scraping works
  3. Check out app.py - learn how to build a UI with Gradio

Quick Start

Prerequisites

  • Python 3.11+
  • uv (recommended) or pip
  • Ollama (optional, for free local models)

Installation

# Clone the repository
git clone https://github.com/YOUR_USERNAME/LLM_Projects_Notes.git
cd LLM_Projects_Notes

# Install dependencies with uv (recommended)
uv sync

# Or with pip
pip install -e .

Setup Environment Variables

# Copy the example env file
cp .env.example .env

# Edit .env and add your API keys
# For OpenAI: Get key at https://platform.openai.com/api-keys

Install Playwright (for JS-heavy sites)

# Install browser binaries
playwright install chromium

Run the App

# Activate the virtual environment (if using uv)
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# Run the Gradio app
python 1_llm_website_summarizer/app.py

Open http://127.0.0.1:7860 in your browser.


Using Ollama (Free, Local)

Ollama lets you run LLMs entirely on your machine - no API keys, no costs, complete privacy.

Setup Ollama

  1. Install Ollama: https://ollama.com/download

  2. Pull a model:

    # Recommended for most machines (4.7GB)
    ollama pull llama3.2
    
    # Lighter alternative (2.0GB)
    ollama pull phi3
    
    # More powerful if you have the RAM (8GB+)
    ollama pull llama3.1
  3. Start Ollama server:

    ollama serve
  4. In the app, select "Ollama" as the provider and choose your model.


Project Structure

LLM_Projects_Notes/
├── 1_llm_website_summarizer/
│   ├── app.py                   # Gradio web UI (the final product)
│   ├── scraper.py               # Web scraping utilities
│   └── summarizer_tutorial.ipynb # Tutorial notebook (start here!)
├── .env.example                 # Example environment variables
├── .gitignore                   # Git ignore rules
├── pyproject.toml               # Project dependencies
└── README.md                    # This file

Configuration

Environment Variables

Variable Required Description
OPENAI_API_KEY For OpenAI Your OpenAI API key
ANTHROPIC_API_KEY Optional For future Anthropic support

Scraper Settings

In app.py, you can adjust:

  • max_chars: Maximum characters to scrape (default: 5000)
  • Playwright fallback is automatic for JS-heavy sites

Extending This Project

Adding New LLM Providers

  1. Add a new function in app.py following the pattern of summarize_with_openai() or summarize_with_ollama()
  2. Add the provider to the dropdown choices
  3. Update the summarize_website() function to handle the new provider

Customizing the Prompt

Edit the SYSTEM_PROMPT constant in app.py to change how summaries are generated.

Using in Your Own Code

from scraper import fetch_website_contents, fetch_website_links

# Get website text content
content = fetch_website_contents("https://example.com", max_chars=5000)

# Get all links on a page
links = fetch_website_links("https://example.com")

Ideas for Enhancement

The scraper includes a fetch_website_links() function that extracts all links from a page. Here are some ideas to build on this project:

Multi-Page Summarizer

Crawl an entire website by following links and summarize each page. Useful for getting a complete picture of a company or product.

from scraper import fetch_website_contents, fetch_website_links

# Get all links from homepage
links = fetch_website_links("https://example.com")

# Filter to same-domain links, then summarize each
for link in links[:10]:  # Limit to avoid hammering the server
    summary = summarize_website(link)
    print(f"## {link}\n{summary}\n")

Link Auditor

Find broken links or analyze where a website links to (external dependencies, partners, etc.).

Research Assistant

Given a topic, scrape multiple sources and generate a consolidated summary with citations.

Content Change Tracker

Periodically scrape a page and use an LLM to identify what changed since last time.


Troubleshooting

"OpenAI API key not found"

  • Make sure you created .env file (copy from .env.example)
  • Check that OPENAI_API_KEY is set correctly
  • Restart the app after changing .env

"Could not connect to Ollama"

"Model not found" (Ollama)

  • Pull the model first: ollama pull llama3.2
  • Check available models: ollama list

Scraper returns very little content

  • Some sites are heavily JavaScript-based
  • Playwright fallback should handle most cases
  • Very dynamic sites (React SPAs) may still be challenging

Security Notes

  • API Keys: Never commit .env files. The .gitignore is configured to exclude them.
  • Local Only: The Gradio app runs on 127.0.0.1 only, not exposed to your network.
  • No Cloud: This app is designed for local use. No data is sent anywhere except to the LLM provider you choose.
  • Private URLs Blocked: The scraper blocks requests to localhost and private IP ranges.

License

MIT License - feel free to use, modify, and distribute.


Contributing

Contributions welcome! Please open an issue or PR.

About

This repository will be used to add all different types of LLMs projects - basic to advanced

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published