A simple crawler for phpBB forums
For now you simply have a bunch of classes. If you do
url = "<your-forum-url>"
foo = BaseCrawler::Forum.new url
foo.crawl_all
tree = foo.root_tree # => :Tree objectThe attribute BaseCrawler::Crawler::root_tree contains a tree with all forum and topic links (only the first pages for now).
The base logic is in the BaseCrawler::Crawler class. Any kind of crawler should inherit from this class and provide a get_data method, which takes in input a Nokogiri::Node and extracts the data. The data should be an Hash, with the following requirements:
- If you extract children, create an array in
data[:children]containing the data required for each child creation (the requirement right now is for eachchild_datato be aHashwith at least the key :url). - If you add new data (in for of
Hash) to the current node in the tree, put the new data indata[:new_data]
Crawl several forum pagesCrawl threadsCrawl multipage-threads- Stop-Resume working
- Better tree structure
- Polish code
- Better documentation