AragoniteCrawler

An aggressive multithreaded web crawler designed to extract valuable information from the clearnet and Tor hidden services, with a focus on digital footprints and subtle organizational change.

Features

Multi-threaded crawling
Tor network support (.onion sites)
DDoS protection with content change detection
Metadata extraction
Domain filtering
Progress saving and resumption
Rate limiting and error handling

Installation

# Clone the repository
git clone [the repo]

# Build the project
cargo build --release

Usage

Single Domain/URL Crawling

Basic usage:

cargo run <domain_or_url> [max_pages] [max_depth] [max_threads] [batch_size]

Examples:

# domain - starts crawling from root "/"
cargo run example.com 10000 2 4

# Full URL with path - starts crawling from the specified path
cargo run https://example.com/sitemap 10000 2 4

# Start from a specific API endpoint
cargo run https://api.github.com/repos 5000 1 2

# Crawl a news site starting from latest articles
cargo run https://news.ycombinator.com/newest 1000 1 3

# Start from a blog's archive page
cargo run https://blog.example.com/archive 2000 2 4

# Use HTTP instead of HTTPS for specific sites
cargo run http://old-site.com/documents 500 1 1

Parameters

domain_or_url: Target website to crawl (required)
- Domain (e.g., example.com): Starts crawling from the root path /
- Full URL (e.g., https://example.com/sitemap): Starts crawling from the specified path
- Supports both HTTP and HTTPS protocols
- Automatically strips www. prefix
max_pages: Maximum number of pages to crawl (default: 10000)
max_depth: Maximum crawl depth (default: 2)
max_threads: Maximum number of concurrent threads (default: 4, auto-optimized)
batch_size: Batch size for processing (default: 20, auto-optimized)

Tor Support

Automatically detects .onion domains and uses Tor SOCKS proxy (port 9050).

# Tor site with domain - starts from root
cargo run opbible7nans45sg33cbyeiwqmlp5fu7lklu6jd6f3mivrjeqadco5yd.onion 1000 1 2

# Tor site with full URL and path
cargo run http://opbible7nans45sg33cbyeiwqmlp5fu7lklu6jd6f3mivrjeqadco5yd.onion/search 1000 1 2

# Regular clearnet site
cargo run example.com 10000 2 4

# Regular clearnet site with specific starting path
cargo run https://example.com/api/docs 5000 2 4

Requirements: Tor running on port 9050 with SOCKS enabled.

System Resource Optimization

The crawler automatically detects your system resources and optimizes thread count and batch sizes:

CPU Detection: Detects available CPU cores
Memory Analysis: Calculates available system memory
Automatic Tuning: Adjusts thread count and batch size based on system capabilities
Tor Optimization: Uses different optimization profiles for .onion sites (more conservative due to network constraints)

System detection output example:

System detected: 8 CPU cores, 15872MB memory
Standard resources: 8 CPU cores, 15872MB memory, threads: 6, batch size: 30
🎯 Starting crawl from: https://example.com/

For Tor sites:

🧅 Tor site detected - using Tor-optimized settings
Tor-optimized resources: 8 CPU cores, 15872MB memory, threads: 4, batch size: 15
🧅 Pre-populated 17 additional common Tor paths for better crawling
🎯 Starting crawl from: http://example.onion/

Bulk Processing

For crawling multiple domains, create an input.txt file with one domain/URL per line:

example.com
https://news.ycombinator.com/newest
test123.onion
https://github.com/explore
reddit.com/r/programming

Then run without arguments:

cargo run

The crawler will:

Process domains concurrently with system-optimized thread limits
Automatically handle both domains and full URLs
Apply appropriate settings for clearnet vs Tor sites
Show progress for each domain being processed

Example output:

Processing 5 domains with 4 concurrent scrapers
Starting domain example.com, active scrapers: 1/4
Starting domain https://news.ycombinator.com/newest, active scrapers: 2/4
...

DDoS + Bot Page Evasion

Detects and evades DDoS and bot protection on Tor
Content hashing to detect real page changes
Waits for content updates, not just dynamic elements
Linear backoff with 30s timeout, 6 retry attempts
Tor-specific handling for circuit timeouts

Metadata Extraction

Extracts metadata from each page:

Title and description
Author information
Publication and update years
Organization details
Jurisdiction information
Content categorization
Tags and keywords
Internal and external links

Output

Data saved in TOML format by domain and category:

["https://gwern.net/banner"]
title = "Banner Ads Considered Harmful  - Gwern.net"
description = "9 months of daily A/B-testing of Google AdSense banner ads on Gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing. "
author = "Gwern Branwen"
year_published = 2017
year_updated = 2020
clearnet = "https://gwern.net/banner"
status = "in progress"
category = "research"
language = "en"
tags = ["ai", "analysis", "bayesian", "blog", "data", "economics", "experiment", "history"]
external-references = ["news.ycombinator.com", "x.com", "arxiv.org", "slatestarcodex.com", "nber.org", "davidreiley.com", "citeseerx.ist.psu.edu", "freakonomics.com", "adage.com", "zinkov.com", "wsj.com", "web.stanford.edu", "web.archive.org", "washingtonpost.com", "uea.ac.uk", "thecorrespondent.com", "tech.okcupid.com", "takimag.com", "storage.googleapis.com", "science.org", "reutersinstitute.politics.ox.ac.uk", "research.mozilla.org", "reddit.com", "radhakrishna.typepad.com", "pdfs.semanticscholar.org"]

Domain Filtering

Filters out social media, CDNs, and academic repositories.

Error Handling

Retry mechanism with consecutive error threshold
Progress saving every 10 pages
Resumable sessions

Configuration

Set RUST_LOG=debug for detailed logging.

Tor config (torrc):

SOCKSPort 9050

License

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.

In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to https://unlicense.org/

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Output/gwern.net		Output/gwern.net
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AragoniteCrawler

Features

Installation

Usage

Single Domain/URL Crawling

Parameters

Tor Support

System Resource Optimization

Bulk Processing

DDoS + Bot Page Evasion

Metadata Extraction

Output

Domain Filtering

Error Handling

Configuration

License

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

License

du82/Crawler

Folders and files

Latest commit

History

Repository files navigation

AragoniteCrawler

Features

Installation

Usage

Single Domain/URL Crawling

Parameters

Tor Support

System Resource Optimization

Bulk Processing

DDoS + Bot Page Evasion

Metadata Extraction

Output

Domain Filtering

Error Handling

Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages