For final project CS505 Spring 2021, directed by professor Derry Wijaya.
Our codes are in the "Codes" folder. We are doing complexity prediction for Single Word.
The training and testing data is from CompLex dataset, which is from https://github.com/MMU-TDMLab/CompLex
More specifically, we use the "lcp_single_train.tsv" in the "train" folder as the training data, and "lcp_single_test.tsv" in the "test-labels" folder as the test data.
The "bibleWEB" folder and the "CRAFTtxt" folder contain additional training data used by cs505project_toucham.ipynb in the "Codes" folder.
Below is a copy of the task description from that link:
This site holds the data associated with the paper: CompLex — A New Corpus for Lexical Complexity Prediction from Likert Scale Data
available at: https://www.aclweb.org/anthology/2020.readi-1.9/
This data is released under CC-BY: https://creativecommons.org/licenses/by/4.0/
We ran a shared task on Lexical Complexity Prediction via SemEval using this data. The data was released according to the following schedule:
-
Trial data available: July 31, 2020
-
Training data available: September 4, 2020
-
Test data available/Evaluation starts: January 11, 2021
See the task website for further information: https://sites.google.com/view/lcpsharedtask2021
See the CodaLab site to view submissions: https://competitions.codalab.org/competitions/27420
The script evaluate.py is the same evaluation script that was used as part of the shared task. You can run this by calling "python evaluate.py <results_path>/res/ <reference_path>/ref/" where res and ref are directories containing the system's results and the reference labels respectively.
The trial, training and test data are available. The trial data comprises of 99 MWEs (29 bible, 33 biomed and 37 europarl) and 421 single word instances (143 bible, 135 biomed and 143 europarl).
The training data comprises of 1,517 MWEs (505 bible, 514 biomed and 498 europarl) and 7,662 single word instances (2,574 bible, 2,576 biomed and 2,512 europarl).
The test data comprises of 184 MWEs (66 bible, 53 biomed and 65 europarl) and 917 single word instances (283 bible, 289 biomed and 345 europarl). It is compressedd and encrypted using 7zip as above. The password will be released to those registered for the task. The order of the test data has been randomised to prevent systems from overfitting to the data ordering. All submissions should be made via CodaLab as above.
The trial and training data is arranged by token and sorted by complexity (i.e., instances with the same token appear together in groups, and the groups are sorted by the complexity of the lowest scored item). Tokens that appear in one partition will not appear in another partition. We have deliberately included more than one instance of a token where possible to identify places where the context affects the complexity of a word. Consider, for example, the following two sentences from the trial data:
- We now have a proposal on the table which lays down strict emissions values for the next ten years and which simultaneously creates clarity and incentives for technological innovations.
- Mr President, in coordination with other groups, I would like to table an oral amendment concerning the draft bill in the Duma to ignore certain rulings of the European Court of Human Rights.
When table is used as a noun in the first sentence (albeit in an abstract sense) it receives a score of 0.01 indicating it is very easily understood. However, when it is used as a verb in the second, it is scored at 0.23, indicating a more difficult word to comprehend.
The data comes from three sources: biblical text, biomedical articles and proceedings of the European Parliament. These sources were selected as they contain a natural mixture of common language and difficult to understand expressions, whilst each containing vastly different domain-specific vocabulary. S