Skip to content

BlankCheng/TinySearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinySearchEngine

A tiny search engine of Wikipedia.

  • based on ~0.5M pages
  • covering 41 main topics
  • including >400k sub-categories

Supports

  • 5 different rank methods
  • field/category-specific search
  • tolerance search, wildcard search
  • show the category structure

Data

The extraction of Wikipedia pages is based on wiki dump. Please download the following data first.

  • The categorylinks table enwiki-latest-categorylinks.sql.gz from link
  • The page table enwiki-latest-page.sql.gz from link
  • The XML file of Wikipedia pages, we choose this one

Note: it may take ~2 days to load the above two SQL files to a MySQL server.

Usage

Data preprocessing

Before data preprocessing, please update your SQL configuration in tree/mysql_config.json.

Construct category tree structure

python ./tree/parse_tree.py --index-folder=/folder/to/save/results

Index

(Reference: https://github.com/DhavalTaunk08/Wiki-Search-Engine)

python ./search/english_indexer.py path_to_xml_dump

Search

python ./search/english_search.py --filename queries.txt --num_results 15

The fields --filename and --num_results are optional. By default --num_results is initilaized to 10. And if you don't pass --filename parameter, it will prompt you to enter query on command line.

Web demo

python ./server/main.py

Below shows some screenshots of the web demo. You can refer to demo.md for more.

index-page

search-main

About

A tiny search engine of Wikipedia

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors