Skip to content
/ Crawler Public

An aggressive multithreaded web crawler / archiver designed to track change over time on websites with a focus on digital footprints and subtle organizational change. Can crawl Tor hidden services!

License

Notifications You must be signed in to change notification settings

du82/Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AragoniteCrawler

An aggressive multithreaded web crawler designed to extract valuable information from the clearnet and Tor hidden services, with a focus on digital footprints and subtle organizational change.

Features

  • Multi-threaded crawling
  • Tor network support (.onion sites)
  • DDoS protection with content change detection
  • Metadata extraction
  • Domain filtering
  • Progress saving and resumption
  • Rate limiting and error handling

Installation

# Clone the repository
git clone [the repo]

# Build the project
cargo build --release

Usage

Single Domain/URL Crawling

Basic usage:

cargo run <domain_or_url> [max_pages] [max_depth] [max_threads] [batch_size]

Examples:

# domain - starts crawling from root "/"
cargo run example.com 10000 2 4

# Full URL with path - starts crawling from the specified path
cargo run https://example.com/sitemap 10000 2 4

# Start from a specific API endpoint
cargo run https://api.github.com/repos 5000 1 2

# Crawl a news site starting from latest articles
cargo run https://news.ycombinator.com/newest 1000 1 3

# Start from a blog's archive page
cargo run https://blog.example.com/archive 2000 2 4

# Use HTTP instead of HTTPS for specific sites
cargo run http://old-site.com/documents 500 1 1

Parameters

  • domain_or_url: Target website to crawl (required)
    • Domain (e.g., example.com): Starts crawling from the root path /
    • Full URL (e.g., https://example.com/sitemap): Starts crawling from the specified path
    • Supports both HTTP and HTTPS protocols
    • Automatically strips www. prefix
  • max_pages: Maximum number of pages to crawl (default: 10000)
  • max_depth: Maximum crawl depth (default: 2)
  • max_threads: Maximum number of concurrent threads (default: 4, auto-optimized)
  • batch_size: Batch size for processing (default: 20, auto-optimized)

Tor Support

Automatically detects .onion domains and uses Tor SOCKS proxy (port 9050).

# Tor site with domain - starts from root
cargo run opbible7nans45sg33cbyeiwqmlp5fu7lklu6jd6f3mivrjeqadco5yd.onion 1000 1 2

# Tor site with full URL and path
cargo run http://opbible7nans45sg33cbyeiwqmlp5fu7lklu6jd6f3mivrjeqadco5yd.onion/search 1000 1 2

# Regular clearnet site
cargo run example.com 10000 2 4

# Regular clearnet site with specific starting path
cargo run https://example.com/api/docs 5000 2 4

Requirements: Tor running on port 9050 with SOCKS enabled.

System Resource Optimization

The crawler automatically detects your system resources and optimizes thread count and batch sizes:

  • CPU Detection: Detects available CPU cores
  • Memory Analysis: Calculates available system memory
  • Automatic Tuning: Adjusts thread count and batch size based on system capabilities
  • Tor Optimization: Uses different optimization profiles for .onion sites (more conservative due to network constraints)

System detection output example:

System detected: 8 CPU cores, 15872MB memory
Standard resources: 8 CPU cores, 15872MB memory, threads: 6, batch size: 30
🎯 Starting crawl from: https://example.com/

For Tor sites:

🧅 Tor site detected - using Tor-optimized settings
Tor-optimized resources: 8 CPU cores, 15872MB memory, threads: 4, batch size: 15
🧅 Pre-populated 17 additional common Tor paths for better crawling
🎯 Starting crawl from: http://example.onion/

Bulk Processing

For crawling multiple domains, create an input.txt file with one domain/URL per line:

example.com
https://news.ycombinator.com/newest
test123.onion
https://github.com/explore
reddit.com/r/programming

Then run without arguments:

cargo run

The crawler will:

  • Process domains concurrently with system-optimized thread limits
  • Automatically handle both domains and full URLs
  • Apply appropriate settings for clearnet vs Tor sites
  • Show progress for each domain being processed

Example output:

Processing 5 domains with 4 concurrent scrapers
Starting domain example.com, active scrapers: 1/4
Starting domain https://news.ycombinator.com/newest, active scrapers: 2/4
...

DDoS + Bot Page Evasion

  • Detects and evades DDoS and bot protection on Tor
  • Content hashing to detect real page changes
  • Waits for content updates, not just dynamic elements
  • Linear backoff with 30s timeout, 6 retry attempts
  • Tor-specific handling for circuit timeouts

Metadata Extraction

Extracts metadata from each page:

  • Title and description
  • Author information
  • Publication and update years
  • Organization details
  • Jurisdiction information
  • Content categorization
  • Tags and keywords
  • Internal and external links

Output

Data saved in TOML format by domain and category:

["https://gwern.net/banner"]
title = "Banner Ads Considered Harmful  - Gwern.net"
description = "9 months of daily A/B-testing of Google AdSense banner ads on Gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing. "
author = "Gwern Branwen"
year_published = 2017
year_updated = 2020
clearnet = "https://gwern.net/banner"
status = "in progress"
category = "research"
language = "en"
tags = ["ai", "analysis", "bayesian", "blog", "data", "economics", "experiment", "history"]
external-references = ["news.ycombinator.com", "x.com", "arxiv.org", "slatestarcodex.com", "nber.org", "davidreiley.com", "citeseerx.ist.psu.edu", "freakonomics.com", "adage.com", "zinkov.com", "wsj.com", "web.stanford.edu", "web.archive.org", "washingtonpost.com", "uea.ac.uk", "thecorrespondent.com", "tech.okcupid.com", "takimag.com", "storage.googleapis.com", "science.org", "reutersinstitute.politics.ox.ac.uk", "research.mozilla.org", "reddit.com", "radhakrishna.typepad.com", "pdfs.semanticscholar.org"]

Domain Filtering

Filters out social media, CDNs, and academic repositories.

Error Handling

  • Retry mechanism with consecutive error threshold
  • Progress saving every 10 pages
  • Resumable sessions

Configuration

Set RUST_LOG=debug for detailed logging.

Tor config (torrc):

SOCKSPort 9050

License

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.

In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to https://unlicense.org/

About

An aggressive multithreaded web crawler / archiver designed to track change over time on websites with a focus on digital footprints and subtle organizational change. Can crawl Tor hidden services!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •  

Languages