Crwaling the data from web and determine most occured words until 3 rd level Brief documentation I used following packages/libraries/apis • Mainly dependency injection, Spring boot framework to make this application simple and readable • For elastic search---spring-data-elasticsearch • For logging-logback framework • For testing -- spring-boot-starter-test and Junit • Multithreaded—threadpoolTaskexecutor • mvn clean verify package to create or deploy application in staging or production environment • For multithread- ThreadPoolExecutor • Jar file can run on different servers by scheduling load balancer .., in this way the application is scalable
Issue I got a problem of compatability issue and took long time to resolve by reading blogs or forums Like latest elasticsearch 2.3.0 – we used spring boot 1.3.5 -- while using there is no compatability. I decided to downgrade elastic search and it worked. Improvements Needs to improve to collect header and links of each depth .
Task_______
Web Crawler in java Create a simple web crawler that, starting with a given web page URL, will extract links that relates to some given search keywords. Crawling should be kept inside the top level domain of the starting URL, with a given depth level.
Input: Domain name, depth, keywords to match
Requirements:
- Make use of Object oriented principles and design patterns
- Application has to be testable, multi-threaded and scalable
- The results are saved into a Elastic Search: as a minimum it has to contain page URL, the links themselves, the matched search term, http headers
- Application has to be in java
- Application will have some kind of logging functionality
- The crawler is easily configurable
- A GUI is nice, but not required, Console application is enough
- Along with the solution, we would like a small document with bullet points (dont spend a lot of time here) describing main challenges you encountered and also main improvement points you would consider for the solution.