A Python implementation of the BM25 for file retrieval
Given a query Q, containing keywords q1,...,qn
, BM25 score of a document is
where the IDF weight of the query term qi
is computed as:
There are two main modules:
QueryParser parses the query to produce a list.
BuildIndex builds an inverted index and computes the scores of the documents according to the BM25 ranking function.
- process_files: processes corpus files to produce a dictionary
- index_one_file & regular_index: map words to their position in the corresponding document
- inverted_index: return a dictionary with each word as the key and its value is another dictionary, whose key is filename and value is word position in that file
- inverse_df: return a dictionary with each word as the key and the IDF as value
- docLen and avgdocl: calculates the length of each document, the average document length in the text collection, respectively
- BM25scores: return BM25 scores of the documents