Skip to content

alansaid/SOParser

Repository files navigation

SOParser

SOParser is a parser and analyzer of the StackOverflow data.

The package contains a bash script for downloading and extracting data into a manageable format. In addition to this the, the repository contains topic modeling code in Python.

Execute ./downloadAndPrepareData.sh to download and prepare the data. NB: You will need ~100gb available disc space to be able to run the script.

Files and explanations

  1. downloadAndPrepareData.sh - a bash script that downloads and prepares the data. The script creates one file per month (Jan. 2013 - Dec. 2014), each file contains the questions and answers posted in that month.
  2. SOParser.py - a Python script that 1) extracts all users that will be used in the analysis (users with minimum 50 posts over 2013-2014), 2) extracts the questions and answers (title, text - excluding code snippets, tags, ) written by those users and saves in data files used in later stages of the analysis. The output is one TSV file per month.
  3. TextProcessor.py - performs tokenization, stemming, TF-IDF, and month-by-month LDA on the files generated by SOParser.py.
  4. TopicComparator.py - Compares topics month-by-month, e.g. compares the topics generated for 2013-05 with the topics generated for 2013-06 and 2013-07, etc.
  5. UserComparator.py - Compares topics month-by-month in terms of users

You might need to run nltk.download() to download stopwords.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published