A production-ready web crawler written in Go with concurrent workers, robots.txt support, and offline-friendly link rewriting.
- Concurrent crawling with configurable worker pool
- Respects
robots.txt - Rate limiting to avoid hammering servers
- Configurable max crawl depth
- Rewrites URLs to relative paths for offline viewing
- Downloads HTML pages, CSS, JavaScript, and images
Requires Go 1.21+.
git clone https://github.com/codingdash/gocrawler.git
cd gocrawler
go mod tidy
go build -o gocrawler .gocrawler -url <target-url> [options]| Flag | Default | Description |
|---|---|---|
-url |
(required) | Target URL to crawl |
-depth |
5 |
Maximum crawl depth |
-workers |
4 |
Number of concurrent workers |
-output |
. |
Output directory |
-user-agent |
gocrawler/1.0 |
User-Agent string |
-rate |
200ms |
Delay between requests |
# Crawl example.com with default settings
gocrawler -url https://example.com
# Crawl with custom depth and output directory
gocrawler -url https://example.com -depth 3 -output ./downloaded
# Crawl with more workers and faster rate
gocrawler -url https://example.com -workers 8 -rate 100msDownloaded files are organized by host and path:
./example.com/
├── index.html
├── about/
│ └── index.html
├── style.css
├── app.js
└── images/
└── logo.png
go test ./crawler/ -v