WikiPedia-Search-Engine

Search engine for entire wikipedia corpus made as minor project in course taken in semester 3, Information Retrieval and Extraction course at IIIT Hyderabad.

Utility

WikiIndexer.py- parses wikipedia dump and makes inverted index
merge.py- merges index files and split them into smaller chunks
search.py- main query program that returns results in less than 5 seconds

Data

Data used is entire wikipedia corpus which is passed in indexer, and indexed results are searched while queries. link

Steps

git clone {url of the page}

or Click 'Clone or Download' on the top right hand side of the repository.

Download the data from here.
Create a folder named Temp in WikiPedia Seach Engine folder.
Run the following commands

python WikiIndexer.py <absolute path of data>
python search.py <absolute path of folder where indexed files stored>

Fire queries to get search results.

Queries
Normal search, give any words, phrases, sentences for search.

Advanced search
Supports various field queries
Use the following syntax for doing advanced search
t: titlename b: words i: infobox r: references e: category

Example-> t: titlename would search the articles with .

Constructing the Inverted Index

BasicStages(inorder):
XML parsing: SAX parser used
Data preprocessing : NLTK used
- Tokenization
- Case folding
- Stop words removal
- Stemming
Posting List / Inverted Index Creation
Optimize

Features:

Support for Field Queries . Fields include Title, Infobox, Body, Category, Links, and References of a Wikipedia page. This helps when a user is interested in searching for the movie ‘Up’ where he would like to see the page containing the word ‘Up’ in the title and the word ‘Pixar’ in the Infobox. You can store field type along with the word when you index.
Index size should be less than 1⁄4 of dump size.
Scalable index construction
Search Functionality
- Index creation time: less than 60 secs for Java, CPP and for python it’s less than 150 secs.
- Inverted index size: 1/4th of entire wikipedia corpus
Advanced search as mentioned above.

References

https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
merge.py		merge.py
search.py		search.py
stop_words.txt		stop_words.txt
wikiIndexer.py		wikiIndexer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiPedia-Search-Engine

Utility

Data

Steps

Constructing the Inverted Index

Features:

References

About

Releases

Packages

Languages

License

avani17101/WikiPedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

WikiPedia-Search-Engine

Utility

Data

Steps

Constructing the Inverted Index

Features:

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages