GitHub - hantaniold/GAMEFAQS: Python-based web crawler for GameFAQs

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
JARS		JARS
JAVASRC		JAVASRC
README		README
boardLinkGen.py		boardLinkGen.py
chunkify.py		chunkify.py
find500s.py		find500s.py
gen500data.py		gen500data.py
getMsgMeta.py		getMsgMeta.py
getTopics.py		getTopics.py
gfCrawlMR.py		gfCrawlMR.py
recoverFromFailure.py		recoverFromFailure.py

Repository files navigation

This is still a work-in-progress, although it works quite nicely. Please note that the paths are all hardcoded for the machine I work on. This means you'll have to edit a few pathnames, in ClusterCrawler.java, gfCrawlMR.py, find500s.py . Basically, anything that works together with Hadoop.

Yes, this means you'll have to recompile the java files (only ClusterCrawler, though..). To do that just use eclipse and point to your hadoop jars. 

Requirements:

For running gfCrawlMR.py : Beautiful Soup 3.2.0, URLLib2
If you want to download in parallel, access to some sort of cluster to run Hadoop on
I recommend for running on a cluster you have some sort of cleaning tool for logfiles, as they aren't very useful in this case and they end up taking obnoxious amounts of space if you run 100,000s of reduce tasks.

------


This is code for a Python/Java based web crawler of gamefaqs.com message boards.

use boardLinkGen.py to get the links to boards. you must manually check how many pages of games there are for some system.

chunkify.py then will look at all these links and create a link for every page within. this is needed for mapreduce to work, since each mapreduce thread will be downloading one page. it helps with fault-tolerance since the recovery time of dying after downloading one page of topics is far less than downloading 499 pages and dying on the last.

you can just clusterwebmapper.jar to run the mapreduce job, with input links (from chunkify.py), output directory, and #reducers (I don't recommend more than 5 per machine, YMMV). i'll add the source for clusterwebmapper.jar later cause right now it won't work unless you have the exact path I hardcoded into it.

gfCrawlMR.py does the downloading. you can test with it manually or let mapreduce use it. the output from chunkify.py works with gfCrawlMR.py ....this is important

recoverFromFailure.py can be used when inevitaby, you are downloading thousands of links and you die near the end. inputting the right things you can get back a list of links that still needed to be explored.

there's about a 1% data loss rate through all of this .mysterious losses with the HDFS read/writes, etc.