Search engine for tilde-based websites
Responsible for:
- Discovering + temp storing users
- Discovering + temp storing public websites
Not creepy at all. Responsible for:
- Downloading and creating per-word document-frequency dictionary for tf-idf
- Storing which websites have been tagged with timestamp and hash of content
- Pulling keywords and tagging websites into general tag dictionary
Content explanantion
tokenize_corpus
andPorter
files - are responsible for cleaning corpus data into stemmed tokens. Needsstopwords.txt
file in same dirdata
file - interfaces with numerous text and json files for easy data managementparse_url
file - handles html, including requests and parsing text and metadatainit_freq_dir
file - creates and/or updates document frequency dictionarycrawl
file - goes thru urls and gathers tags + metadata for dictionaries
This document last updated: Jul 20 2020