Skip to content
This repository was archived by the owner on Nov 22, 2017. It is now read-only.

tjake/stormscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Storm Scraper

TL;DR; Storm Scraper is an example storm program. Please do not think it's a production ready.

Storm Scraper is a simple storm topology that let's you crawl a website n-levels deep. It reads the list of sites to scrape from Cassandra and stores the html, incoming links, outgoing links, text.

I've only tested this locally

Setting are in src/main/resources/scraper.properties

To Run:

  • Run Cassandra

  • Create schema

cqlsh < stormscraper.cql
  • Run storm topology locally
MAVEN_OPTS=-Xmx1g mvn compile exec:java   

Brought to you by @tjake

About

A Storm based web crawler with Cassandra backend

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages