A toy search engine that searches the web inside your terminal :p
- Implemented in C++14.
- Crawls webpages progressively starting from seed URL(s).
- Parses the documents and the query, trying to generate more appropriate results.
- Builds an index (hash map) for the parsed documents.
- The crawled documents and index are refreshed periodically.
- Autocompletes query using a trie, based on most recently asked queries.
- Maintains two threads, to allow refreshing the index and querying simultaneuosly.
- Generates most relevant results in order ranked on the basis of harmonic mean of PageRank (to get the importance of webpage) and Okapi BM25 (to get query-based result) algorithm ranks.
- Provides query suggestions (only when the input query does not generate any results), on the basis of common incorrect and correct words. Ranks them using n-gram algorithm and edit-distance DP to compare two strings.
Command to run : wunner_search
(make sure your PWD is the project's root directory)
Add option -f
or --fresh
as in wunner_search -f
to start the search engine afresh (i.e., crawling and indexing again)
- After indexing gets completed, simply type your query and hit Enter to start searching
- To use autocomplete, press Ctrl+G while typing query and then type the desired result's number to complete the query (it's not of relevance until a web UI is developed)
- Clone (
git clone https://github.com/Anishka0107/Wunner.git
) or download this repository cd Wunner
from where it was cloned/downloaded
- Requirements : GCC (5.0 & above) / Clang (3.4 & above), Boost, Wget
- Two options :
- Requires
ar
:- Run
chmod +x wunner_build.sh
- Run
./wunner_build.sh
(note that this defaults to g++ compiler; append compiler name to use other, eg:./wunner_build.sh clang++
)
- Run
- Requires
cmake
andmake
:- Run
mkdir -p build && cd build && cmake .. && make -j$(nproc)
- Run
- Requires
- Ultimately run
wunner_search
(either directly./build/bin/wunner_search
or doexport PATH=$PATH:${PWD}/build/bin
before)
- Set up Docker on your system (need root priviledges for docker commands)
- Build the image using
docker build -t wunner .
- Run using
docker run -v ${PWD}:/tmp wunner wunner_search
(append wunner_search options if required)
- Add simple main() tests for each module
- For terminal based, show appropriate outputs at each step
- Add colours beautify the output
- Command line options for
res
files - Add support for complete matching queries
- Add support for relative URLs on webpage
- Implement interaction with robots.txt in crawler
- Build a web UI
- Database instead of files to store objects
- Dynamic linking in build
- Crawler Seed URLs ->
- Erroneous Words ->
- List of Stop Words -> https://www.webconfs.com/stop-words.php