Skip to content

codingdash/gocrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gocrawler

A production-ready web crawler written in Go with concurrent workers, robots.txt support, and offline-friendly link rewriting.

Features

  • Concurrent crawling with configurable worker pool
  • Respects robots.txt
  • Rate limiting to avoid hammering servers
  • Configurable max crawl depth
  • Rewrites URLs to relative paths for offline viewing
  • Downloads HTML pages, CSS, JavaScript, and images

Setup

Requires Go 1.21+.

git clone https://github.com/codingdash/gocrawler.git
cd gocrawler
go mod tidy
go build -o gocrawler .

Usage

gocrawler -url <target-url> [options]

Flags

Flag Default Description
-url (required) Target URL to crawl
-depth 5 Maximum crawl depth
-workers 4 Number of concurrent workers
-output . Output directory
-user-agent gocrawler/1.0 User-Agent string
-rate 200ms Delay between requests

Examples

# Crawl example.com with default settings
gocrawler -url https://example.com

# Crawl with custom depth and output directory
gocrawler -url https://example.com -depth 3 -output ./downloaded

# Crawl with more workers and faster rate
gocrawler -url https://example.com -workers 8 -rate 100ms

Output

Downloaded files are organized by host and path:

./example.com/
├── index.html
├── about/
│   └── index.html
├── style.css
├── app.js
└── images/
    └── logo.png

Testing

go test ./crawler/ -v

About

Go crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages