-
Notifications
You must be signed in to change notification settings - Fork 33
Roadmap
What to do to replicate ESA as in Gabrilovich et al. :
- only keep articles of main namespace (meaning, discard categories, Wikipedia:, Help:, File: etc.)
- discard articles in month_year (e.g. January 2002) format
- discard articles in year_in… (e.g. 2002 in literature, 1996 in the Olympics) format
- discard articles in only digit format (e.g. 1996, 819382, 42)
- discard articles in list format (e.g. List of … )
- discard articles belonging to a stop category list (provided with source)
- discard articles with inlinks < 5 or outlinks < 5
- discard articles with fewer than 100 unique non-stop words
- use these characters to tokenize (consider these as whitespace for splitting):
String strTokenSplit = " \t\n\r`~!@#$%^&*()_=+|[;]{},./?<>:’\\\"";
- use TITLE_WEIGHT = 4. To apply this, you can append 4 instances of article title to the article text you will be indexing.
- add anchor text to target articles
- run Porter stemmer 3 times, instead of just once
- apply normalization on TF-IDF scores
- prefer more general articles slightly by using: log(log(TFIDF))
- use WINDOW_SIZE = 100, WINDOW_THRES = 0.005