Skip to content

hantaniold/GAMEFAQS

Repository files navigation

This is still a work-in-progress, although it works quite nicely. Please note that the paths are all hardcoded for the machine I work on. This means you'll have to edit a few pathnames, in ClusterCrawler.java, gfCrawlMR.py, find500s.py . Basically, anything that works together with Hadoop.

Yes, this means you'll have to recompile the java files (only ClusterCrawler, though..). To do that just use eclipse and point to your hadoop jars. 

Requirements:

For running gfCrawlMR.py : Beautiful Soup 3.2.0, URLLib2
If you want to download in parallel, access to some sort of cluster to run Hadoop on
I recommend for running on a cluster you have some sort of cleaning tool for logfiles, as they aren't very useful in this case and they end up taking obnoxious amounts of space if you run 100,000s of reduce tasks.

------


This is code for a Python/Java based web crawler of gamefaqs.com message boards.

use boardLinkGen.py to get the links to boards. you must manually check how many pages of games there are for some system.

chunkify.py then will look at all these links and create a link for every page within. this is needed for mapreduce to work, since each mapreduce thread will be downloading one page. it helps with fault-tolerance since the recovery time of dying after downloading one page of topics is far less than downloading 499 pages and dying on the last.

you can just clusterwebmapper.jar to run the mapreduce job, with input links (from chunkify.py), output directory, and #reducers (I don't recommend more than 5 per machine, YMMV). i'll add the source for clusterwebmapper.jar later cause right now it won't work unless you have the exact path I hardcoded into it.

gfCrawlMR.py does the downloading. you can test with it manually or let mapreduce use it. the output from chunkify.py works with gfCrawlMR.py ....this is important

recoverFromFailure.py can be used when inevitaby, you are downloading thousands of links and you die near the end. inputting the right things you can get back a list of links that still needed to be explored.

there's about a 1% data loss rate through all of this .mysterious losses with the HDFS read/writes, etc.


About

Python-based web crawler for GameFAQs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published