GitHub - ysminnpu/SphinxCrawler: Sphinx-enabled web crawler

#Crawler for Sphinx search engine

##What's this? This is a web crawler that can crawl through any websites, download it's content and save it in Sphinx's realtime index.

##Download https://github.com/olegserkov/crawler/raw/master/crawler-0.1-alpha.zip

##Features

crawl through website within specified domain
extract links from website
store pages to disk
compress page's content with deflate algorithm
work through a proxy
retry on error (502-504)
intelligent behavior on various response codes
configurable fetch rate, maximum retries number, maximum number of sequential redirects, maximum number of consecutive failed requests (returning code 502, 503, 504) to decide host as unavailable, maximum document size (chunked-transfer supported)
revisit pages after specified time
configurable User-Agent header and list of applicable user-agents from robots.txt (for example, use rules only for "*" ang "google")
filter duplicate pages
robots.txt compliance
<meta name="robots" ...> compliance
save documents in Sphinx's realtime index and update it
written in Java

##Archive structure

config - configuration file is stored here
data - by default, the directory for fetched content. Can be altered by "directory" directive. Fetched pages are stored in <directory>/<hostname>/<first_letter_of_md5_hashed_url>/<second_letter_of_md5_hashed_url>/<md5_hashed_url>.html Example: data/en.wikipedia.org/c/d/cd6c8f619fe02d9ea5d283cea1dfdefc.html
db - directory for crawler's database
lib - libraries used by crawler
crawler.jar - main executable file

##Config Template config file is in archive and contains detailed comments for each directive.

##Requirements Java version 7 or higher.

##Usage Settings for simple website.

###1. Unpack the archive

###2. Set up Sphinx's index:

index simple_website
{	
	dict     = keywords
	type     = rt
	path     = /var/lib/sphinxsearch/data/simple_website
	rt_field = title
	rt_field = content
	rt_field = url
	rt_field = host
	
	rt_attr_string    = url
	rt_attr_string    = host
	rt_attr_string    = file
	rt_attr_timestamp = created_at
	rt_attr_timestamp = fetched_at
	
	docinfo			= extern
	charset_type	= utf-8
	morphology      = libstemmer_en, libstemmer_ru
}

Don't forget to restart Sphinx.

###3. Set up crawler (config/config.conf)

[simple]
# Enable or disable this section
enabled = true

module          = sphinx
sphinxIndex     = simple_website
sphinxHost      = 127.0.0.1
sphinxPort      = 9306
userAgent       = "Simple website crawler/1.0"

# Initial urls list, here can be multiple directives
initialUrl = http://simple-website.com/

###4. Run java -jar crawler.jar

##If you are not using Sphinx If you are not using Sphinx and just want to fetch the website and save it to disk, just skip "module", "sphinxIndex", "sphinxHost" and "sphinxPort" directives and crawler will do it.

##Known issues

For now it crawls only pages within specified domain. E.g. if initialUrl = http://domain.org, all links from www.domain.org are ignored. Other policies are to be implemented.
Sometimes it takes too long for crawler to exit on Ctrl-C. It's because of database engine that i used for this crawler. It performs optimization on shutdown, so it's recommended to wait for the end of this process.
Minimum fetch rate is 1 page per second. It can be a problem for very slow websites, but is suitable for most of others.

##What about code? I will show it after some refactoring. Now there's too much shit.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
crawler-0.1-alpha.zip		crawler-0.1-alpha.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

ysminnpu/SphinxCrawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages