New set of web search crawlers and infrastructure
authorMagnus Hagander <magnus@hagander.net>
Sat, 14 Jan 2012 17:57:48 +0000 (18:57 +0100)
committerMagnus Hagander <magnus@hagander.net>
Sat, 21 Jan 2012 14:27:06 +0000 (15:27 +0100)
commitb8a2015be2fc9d2353d7035f22af4813d2e52f5f
treebb5bb20e9204129ee33332841730dad79f4f06ac
parent62983855babf30fa4417a75d0b5fe327eec510f6
New set of web search crawlers and infrastructure

Replaces the old search code with something that's not quite as much
spaghetti (e.g. not evolved over too much time), and more stable (actual
error handling instead of random crashes)

Crawlers are now also multithreaded to deal with higher latency to some
sites.
20 files changed:
tools/search/crawler/.gitignore [new file with mode: 0644]
tools/search/crawler/lib/__init__.py [new file with mode: 0644]
tools/search/crawler/lib/archives.py [new file with mode: 0644]
tools/search/crawler/lib/basecrawler.py [new file with mode: 0644]
tools/search/crawler/lib/genericsite.py [new file with mode: 0644]
tools/search/crawler/lib/log.py [new file with mode: 0644]
tools/search/crawler/lib/parsers.py [new file with mode: 0644]
tools/search/crawler/lib/sitemapsite.py [new file with mode: 0644]
tools/search/crawler/lib/threadwrapper.py [new file with mode: 0644]
tools/search/crawler/listcrawler.py [new file with mode: 0755]
tools/search/crawler/listsync.py [new file with mode: 0755]
tools/search/crawler/search.ini.sample [new file with mode: 0644]
tools/search/crawler/webcrawler.py [new file with mode: 0755]
tools/search/sql/README [new file with mode: 0644]
tools/search/sql/data.sql [new file with mode: 0644]
tools/search/sql/functions.sql [new file with mode: 0644]
tools/search/sql/indexes.sql [new file with mode: 0644]
tools/search/sql/pg_dict.syn [new file with mode: 0644]
tools/search/sql/schema.sql [new file with mode: 0644]
tools/search/sql/tsearch.sql [new file with mode: 0644]