This repo contains the code and experimental results for the D2T Datasets Content Type Profiling.
Note: This repository is to compliment the submitted paper. It will be deleted after conference's anonymity period is over.
-
Train a Multi-Label Content Type classifier with and without Active Learning
- Without AL
CUDA_VISIBLE_DEVICES=0 python3 src/al_main.py -dataset mlb -a_class -e_class
- With AL
CUDA_VISIBLE_DEVICES=0 python3 src/al_main.py -dataset mlb -a_class -e_class -do_al -qs qbc -tk 25
-
Plot Content Type Distribution graphs for different datasets
CUDA_VISIBLE_DEVICES=0 python3 src/plot_res.py -dataset mlb -a_class -e_class -type gold_ns
-
Evaluate the performance of NLG systems' output texts on different metrics
sh run_eval.sh mlb acc 0
-
Label: Content Type classifier data; and accuracy errors in NLG systems' output texts
- Use Label-Studio to label the data - config saved in
labdata
(Docker needed)
docker run -it -p 8080:8080 -v `pwd`/labdata:/label-studio/data heartexlabs/label-studio:latest
- Use Label-Studio to label the data - config saved in
-
sportsett/
: everything used for the sportsett data experimentssportsett/data/
: contains data/annotations for builidng Content-Type classifiersportsett/data/initial
: contains data for training generation systems
sportsett/eval/
: contains data/annotations from human evaluation of system generated summaries
-
mlb/
: everything used for the mlb data experiments -
sumtime/
: everything used for the sumtime data experiments -
obituary/
: everything used for the obituary data experiments -
labdata/
: folder to store the docker data for labelling app (databse and settings) -
eval/
: contains code for calculating evaluation results -
src/
: contains the source codeal_utils.py
: contains the functions for active learningclf_utils.py
: contains the functions for classifierbert_utils.py
: contains just plain bert classifier (fine-tuned on this data)merge_annotated.py
: merges the annotated json file with the already annotated samples intrain.tsv
fileal_main.py
: contains the main code for classifier and active learningabs_sent.py
: contains the functions for sentence abstracting (using PoS/NER tags)plot_res.py
: code for plotting different across dataset graphsrw_plots.py
: code for plotting grpahs specific to RotoWire and SportSett
-
run_first.sh
: script to run the first time to create thetop_{k}_unlabelled.txt
file. -
run_active_learning.sh
: script to run the afterrun_first.sh
is executed once and thetop_{k}_unlabelled.txt
file is created. -
plots.sh
: script to plot performance change with change in data.
Download and save the trained models in the respective datasets' folder from GDrive.
-
Annotate some data and create the
train.tsv
/valid.tsv
files in{dataset_name}/data/tsvs
folder. -
Run to create the
top_{k}_unlabelled.txt
file.python3 src/al_main.py -qs qbc -tk 25 -dataset mlb -do_al -a_class
-
Take the
top_{k}_unlabelled.txt
file from{dataset_name}/data/txts
folder and annotate it. -
Save the annotations is
json
format in{dataset_name}/data/jsons
folder with nameannotations.json
. -
Run the the following to merge new annotations with the existing ones in
{dataset_name}/data/tsvs/train.tsv
file.python3 src/merge_annotated.py -dataset mlb -not_first_run
-
Now again run the
src/al_main.py
to retrain models on extended data and create newtop_{k}_unlabelled.txt
file.python3 src/al_main.py -qs qbc -tk 25 -dataset mlb -do_al -a_class
-
Repeat step 3 to step 6 until needed.
In terms of what files to run and in what order:
sh run_first.sh
- Label the samples from unlabelled pool using Label-Studio app (specifically, label the samples in
data/txts/top_{k}_unlabelled.txt
file, where k is the TOP_K insrc/main.py
) sh run_active_learning.sh
- Repeat 2 & 3 until you have labelled all the samples or reached desired performance
- Make sure to run
pip install -r requirements.txt
before running the scripts.
-
Run
run_first.sh
this will first train models on test data and then rank the samples from unlabelled pool based on uncertainity.- This will create models in
models/
and ftrs inftrs/
- In
data/txts
new filetop_{k}_unlabelled.txt
will be created with top {k} samples from unlabelled pool.
- This will create models in
-
Label the samples from unlabelled pool (
data/txts/top_{k}_unlabelled.txt
).- Save the annotated json file in
data/json/annotated.json
- Save the annotated json file in
-
Merge the newly annotated and existing annotated samples and repeat the process from 1-3.
- This can be done by
run_active_learning.sh
- This can be done by
We use Label-Studio for labelling the messages.
For that you need docker to be installed.
-
Install docker and start the engine.
-
Run the following command to start the app:
docker run -it -p 8080:8080 -v `pwd`/labdata:/label-studio/data heartexlabs/label-studio:latest
-
Go to
http://localhost:8080
and login with the following credentials:Email: nlg.ct Password: nlg.ct12345
-
If no data is present, then you would need to upload the data. The following screen should be visible:
- Follow the instructions from 5-7 if no file is uploaded.
- If file is uploaded, then you would need to upload the data again. For that, follow the instructions from 8-9.
- The following screen should be visible if data is already uploaded:
-
Upload the unlabelled data (
./data/txts/top_{k}_unlabelled.txt
file) by following these steps:-
Click the Go to import button.
-
Either click Upload Files or drag and drop the file into the Drop Area.
-
Select the List of tasks option for Treat CSV/TSV as question.
-
Now click Import button on top right corner. You will see the following screen:
-
-
Now you can start labelling the data.
- Cilck on either Label All Tasks button or any of row.
- You will see the sentence for labelling with possible labels.
- Select the labels (more than one can be selected) and click Submit button.
-
After labelling, cick the Export button. Select JSON option and click the Export button.
- This will download the file to your local machine (at preffered download location).
- Save the file in
data/jsons/annotated.json
. Make sure to remove any existing file from thedata/jsons
location.
-
If already some data is uploaded, then you would need to delete the existing data upload the the new one again. The following screen should be visible:
-
To delete the existing data, follow these steps:
-
Click the box in front of ID in top-left. This will select all the rows.
-
Click the {k} Tasks button above ID. Click the Delete tasks button from the drop-down menu appeared. Here's a screenshot:
-
-
After deleting the data, you will see the screen similar to the one shown below:
- Follow the instructions from 6-7 to start labelling the data.