This project looks to explore how the writing and wording of comments (with pronouns removed) on ratemyprofessor.com (RMP) can be used to determine the professor's gender. The classification algorithms used are Naive Bayes, Rocchio Algorithm, and K-Nearest Neighbor.
This project consists of the following programs and data files:
- a webcrawler that crawls RMP pages for 21 universities and outputs to one file
- a text-parser that converts the raw data file into data files of comments for individual professors
- parseDataIntoFiles.py
- produces allData
- a text processer that tokenizes, removes stopwords, and stems the files in allData and produces prerocessedData
- a program used to predict the gender of professors using Naive Bayes. Uses the 'leave one out' strategy, and trains on the remaining preprocessed data files
- a program used to predict the gender of professors using Rocchio. Uses the 'leave one out' strategy, and trains on the remaining preprocessed data files
- a program used to extract top adjectives used by students to describe male and female professors
- a program used to preprocess comment crawler data for new format, to include regional CS professors
- Data file containing all the raw text from commentCrawler.py
- Folder of data files containing comments on each professor's RMP page parsed from parseDataIntoFiles.py without preprocessing
- Folder of data files from allData that have been preprocessed by preprocessAllFiles.py
- Folder of data files of additional male professors (removed to balance the number of male and female professors)
- Folder of data files of data from CS professors across NESW regions before processing
- Folder of data files of data from CS professors across NESW regions before processing
- Results of classifications using Naive Bayes from preprocessedData
- Results of different words used for each professor group
- Results of nearest neighbor after boosting word frequencies
- Results of nearest neighbor after boosting word frequencies in Excel
- Results of nearest neighbor without boosting word frequencies in Excel (additional)
- Results of professor ratings from CS departements across NESW regions (for two-sample t-test)
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
A computer with python 3.7 and the following packages installed:
- pip
- nltk
- selenium
- BeautifulSoup
- Python 3 Virtual Environment (optional)
This project is licensed under the MIT License - see the LICENSE.md file for details