forum-crawler

A simple crawler for phpBB forums

Usage

For now you simply have a bunch of classes. If you do

url = "<your-forum-url>"
foo = BaseCrawler::Forum.new url
foo.crawl_all
tree = foo.root_tree # => :Tree object

The attribute BaseCrawler::Crawler::root_tree contains a tree with all forum and topic links (only the first pages for now).

How it works

The base logic is in the BaseCrawler::Crawler class. Any kind of crawler should inherit from this class and provide a get_data method, which takes in input a Nokogiri::Node and extracts the data. The data should be an Hash, with the following requirements:

If you extract children, create an array in data[:children] containing the data required for each child creation (the requirement right now is for each child_data to be a Hash with at least the key :url).
If you add new data (in for of Hash) to the current node in the tree, put the new data in data[:new_data]

TODO

~~Crawl several forum pages~~
~~Crawl threads~~
~~Crawl multipage-threads~~
Stop-Resume working
Better tree structure
Polish code
Better documentation

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
crawler		crawler
.gitignore		.gitignore
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
basecrawler.rb		basecrawler.rb
crawler.rb		crawler.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

forum-crawler

Usage

How it works

TODO

About

Uh oh!

Releases

Packages

Languages

License

luvemil/forum-crawler

Folders and files

Latest commit

History

Repository files navigation

forum-crawler

Usage

How it works

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages