Rate My Professor Gender Classifier

This project looks to explore how the writing and wording of comments (with pronouns removed) on ratemyprofessor.com (RMP) can be used to determine the professor's gender. The classification algorithms used are Naive Bayes, Rocchio Algorithm, and K-Nearest Neighbor.

This project consists of the following programs and data files:

Programs for acquiring data, processing, and classification

a webcrawler that crawls RMP pages for 21 universities and outputs to one file
- commentCrawler.py
a text-parser that converts the raw data file into data files of comments for individual professors
- parseDataIntoFiles.py
- produces allData
a text processer that tokenizes, removes stopwords, and stems the files in allData and produces prerocessedData
- preprocessAllFiles.py
a program used to predict the gender of professors using Naive Bayes. Uses the 'leave one out' strategy, and trains on the remaining preprocessed data files
- naiveBayes.py
a program used to predict the gender of professors using Rocchio. Uses the 'leave one out' strategy, and trains on the remaining preprocessed data files
- rocchio.py
a program used to extract top adjectives used by students to describe male and female professors
- AdjectiveFreq.py
a program used to preprocess comment crawler data for new format, to include regional CS professors
- preprocessCommentCrawler.py

Data and Output Files

Data file containing all the raw text from commentCrawler.py
- commentCrawler.output
Folder of data files containing comments on each professor's RMP page parsed from parseDataIntoFiles.py without preprocessing
- allData
Folder of data files from allData that have been preprocessed by preprocessAllFiles.py
- preprocessedData
Folder of data files of additional male professors (removed to balance the number of male and female professors)
- extraMaleData
Folder of data files of data from CS professors across NESW regions before processing
- commentCrawlerOutput
Folder of data files of data from CS professors across NESW regions before processing
- commentCrawlerOutputPreprocessed
Results of classifications using Naive Bayes from preprocessedData
- naivebayes.output
Results of different words used for each professor group
- difference_analysis
Results of nearest neighbor after boosting word frequencies
- nearestNeighbour.boosted.output
Results of nearest neighbor after boosting word frequencies in Excel
- nearestNeighbour.boosted.output.excel
Results of nearest neighbor without boosting word frequencies in Excel (additional)
- nearestNeighbour.output.excel
Results of professor ratings from CS departements across NESW regions (for two-sample t-test)
- prof_ratings.csv

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

A computer with python 3.7 and the following packages installed:

pip
nltk
selenium
BeautifulSoup
Python 3 Virtual Environment (optional)

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Porter Stemmer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Rate My Professor Gender Classifier

Programs for acquiring data, processing, and classification

Data and Output Files

Getting Started

Prerequisites

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Rate My Professor Gender Classifier

Programs for acquiring data, processing, and classification

Data and Output Files

Getting Started

Prerequisites

License

Acknowledgments