An aggressive multithreaded web crawler designed to extract valuable information from the clearnet and Tor hidden services, with a focus on digital footprints and subtle organizational change.
- Multi-threaded crawling
- Tor network support (.onion sites)
- DDoS protection with content change detection
- Metadata extraction
- Domain filtering
- Progress saving and resumption
- Rate limiting and error handling
# Clone the repository
git clone [the repo]
# Build the project
cargo build --releaseBasic usage:
cargo run <domain_or_url> [max_pages] [max_depth] [max_threads] [batch_size]Examples:
# domain - starts crawling from root "/"
cargo run example.com 10000 2 4
# Full URL with path - starts crawling from the specified path
cargo run https://example.com/sitemap 10000 2 4
# Start from a specific API endpoint
cargo run https://api.github.com/repos 5000 1 2
# Crawl a news site starting from latest articles
cargo run https://news.ycombinator.com/newest 1000 1 3
# Start from a blog's archive page
cargo run https://blog.example.com/archive 2000 2 4
# Use HTTP instead of HTTPS for specific sites
cargo run http://old-site.com/documents 500 1 1domain_or_url: Target website to crawl (required)- Domain (e.g.,
example.com): Starts crawling from the root path/ - Full URL (e.g.,
https://example.com/sitemap): Starts crawling from the specified path - Supports both HTTP and HTTPS protocols
- Automatically strips
www.prefix
- Domain (e.g.,
max_pages: Maximum number of pages to crawl (default: 10000)max_depth: Maximum crawl depth (default: 2)max_threads: Maximum number of concurrent threads (default: 4, auto-optimized)batch_size: Batch size for processing (default: 20, auto-optimized)
Automatically detects .onion domains and uses Tor SOCKS proxy (port 9050).
# Tor site with domain - starts from root
cargo run opbible7nans45sg33cbyeiwqmlp5fu7lklu6jd6f3mivrjeqadco5yd.onion 1000 1 2
# Tor site with full URL and path
cargo run http://opbible7nans45sg33cbyeiwqmlp5fu7lklu6jd6f3mivrjeqadco5yd.onion/search 1000 1 2
# Regular clearnet site
cargo run example.com 10000 2 4
# Regular clearnet site with specific starting path
cargo run https://example.com/api/docs 5000 2 4Requirements: Tor running on port 9050 with SOCKS enabled.
The crawler automatically detects your system resources and optimizes thread count and batch sizes:
- CPU Detection: Detects available CPU cores
- Memory Analysis: Calculates available system memory
- Automatic Tuning: Adjusts thread count and batch size based on system capabilities
- Tor Optimization: Uses different optimization profiles for .onion sites (more conservative due to network constraints)
System detection output example:
System detected: 8 CPU cores, 15872MB memory
Standard resources: 8 CPU cores, 15872MB memory, threads: 6, batch size: 30
🎯 Starting crawl from: https://example.com/
For Tor sites:
🧅 Tor site detected - using Tor-optimized settings
Tor-optimized resources: 8 CPU cores, 15872MB memory, threads: 4, batch size: 15
🧅 Pre-populated 17 additional common Tor paths for better crawling
🎯 Starting crawl from: http://example.onion/
For crawling multiple domains, create an input.txt file with one domain/URL per line:
example.com
https://news.ycombinator.com/newest
test123.onion
https://github.com/explore
reddit.com/r/programming
Then run without arguments:
cargo runThe crawler will:
- Process domains concurrently with system-optimized thread limits
- Automatically handle both domains and full URLs
- Apply appropriate settings for clearnet vs Tor sites
- Show progress for each domain being processed
Example output:
Processing 5 domains with 4 concurrent scrapers
Starting domain example.com, active scrapers: 1/4
Starting domain https://news.ycombinator.com/newest, active scrapers: 2/4
...
- Detects and evades DDoS and bot protection on Tor
- Content hashing to detect real page changes
- Waits for content updates, not just dynamic elements
- Linear backoff with 30s timeout, 6 retry attempts
- Tor-specific handling for circuit timeouts
Extracts metadata from each page:
- Title and description
- Author information
- Publication and update years
- Organization details
- Jurisdiction information
- Content categorization
- Tags and keywords
- Internal and external links
Data saved in TOML format by domain and category:
["https://gwern.net/banner"]
title = "Banner Ads Considered Harmful - Gwern.net"
description = "9 months of daily A/B-testing of Google AdSense banner ads on Gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing. "
author = "Gwern Branwen"
year_published = 2017
year_updated = 2020
clearnet = "https://gwern.net/banner"
status = "in progress"
category = "research"
language = "en"
tags = ["ai", "analysis", "bayesian", "blog", "data", "economics", "experiment", "history"]
external-references = ["news.ycombinator.com", "x.com", "arxiv.org", "slatestarcodex.com", "nber.org", "davidreiley.com", "citeseerx.ist.psu.edu", "freakonomics.com", "adage.com", "zinkov.com", "wsj.com", "web.stanford.edu", "web.archive.org", "washingtonpost.com", "uea.ac.uk", "thecorrespondent.com", "tech.okcupid.com", "takimag.com", "storage.googleapis.com", "science.org", "reutersinstitute.politics.ox.ac.uk", "research.mozilla.org", "reddit.com", "radhakrishna.typepad.com", "pdfs.semanticscholar.org"]
Filters out social media, CDNs, and academic repositories.
- Retry mechanism with consecutive error threshold
- Progress saving every 10 pages
- Resumable sessions
Set RUST_LOG=debug for detailed logging.
Tor config (torrc):
SOCKSPort 9050
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to https://unlicense.org/