Skip to content

Latest commit

 

History

History
93 lines (62 loc) · 2.71 KB

getting_started.md

File metadata and controls

93 lines (62 loc) · 2.71 KB

Getting started with using the MSR2021Replication

Install Mallet

To install the Mallet tool, first is necessary to have the Apache ant build tool installed. Install the binary from https://ant.apache.org/ and follow the manual instructions to configure it.

With ant installed and configured, open the Mallet 2.0.8 folder in the MSR2021Replication repository at mallet/mallet-2.0.8 and run the following command:

$ ant

The Mallet tool will be available to use at mallet/mallet-2.0.8/bin/mallet.

Run with jupyter notebook

The jupyter notebook can be used for StackOverflow datasets. To run the jupyter notebook run the following command on the repository root.

$ jupyter-notebook SO_dataset_analysis.ipynb

Follow the notebook instructions to import the correct dataset and run the scripts.

Run with bash

Install python libraries

Run the following command to install the libraries in the scripts:

$ pip install -r notebook/requirements.txt

Open a Python3 console with the command:

$ python3

Inside the console download the nltk packages by running the following code:

import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('word_tokenize')
nltk.download('tokenize')
nltk.download('stem')

Export the variables

Use the following commands to export the variables so scripts can use the correct path to the dataset and output folder.

# Export path to the raw dataset
$ export DATASET_PATH=./tcc/so_questions.csv

# Export the output path
$ export OUTPUT_PATH=./output

# Export the number of topics division
$ export TOPICS_NUM=15

Prepare dataset

To run the MSR2021Replication, it is necessary to run the following script to parse the .csv dataset, clean it and create documents to be used by the mallet tool.

$ python3 prepare_dataset.py

Run the Mallet tool

Run mallet instructions:

$ mallet/mallet-2.0.8/bin/mallet import-dir --input $OUTPUT_PATH/so_data/ --output $OUTPUT_PATH/so.mallet --keep-sequence --remove-stopwords --extra-stopwords extra_stopwords/so.txt
$ mallet/mallet-2.0.8/bin/mallet train-topics --random-seed 100 --input $OUTPUT_PATH/so.mallet --num-topics 15 --optimize-interval 20 --output-state $OUTPUT_PATH/so-topic-state.gz --output-topic-keys $OUTPUT_PATH/so_keys.txt --output-doc-topics $OUTPUT_PATH/so_composition.txt --diagnostics-file $OUTPUT_PATH/so_results/so_diagnostics.xml

Parse mallet results

After running the mallet tool, run the following script.

$ python3 manage_results.py

This script will create document files for each topic containg all the questions related to this topic.