This repository contains the code for the paper NounAtlas: Filling the Gap in Nominal Semantic Role Labeling.
NounAtlas website is out! Check out the latest version of our resource in the Download section.
Our model for joint nominal and verbal SRL will be usable soon through the InVeRo API.
Our dataset for nominal SRL is available on Hugging Face 🤗
https://huggingface.co/datasets/sapienzanlp/nounatlas_srl_corpus
This section provides instructions to reproduce the experiments and results described in the paper. Each section corresponds to a specific paragraph in the paper, refer to it for further details.
- Create a conda environment with Python 3.9 or higher.
- Activate the environment.
- Navigate to the root of the project.
- Install required packages: Open a terminal and run the following command:
pip install -r requirements.txt
WordNet-based synset-to-frame mapping
- Generate link files: Run the following command to generate the "Unambiguous", "Manually-curated", and "Non-existing" link files (they're already present in the repo):
python code_files_nominal_classification/data_preprocessing/build_datasets.py
Ranking frames for unlinked synsets
- Train the Cross-Encoder model: Train the Cross-Encoder model for the verbal to nominal definition grouping task (Section 3.1.3) using the following command:
python code_files_nominal_classification/crossencoder_main.py --pipeline_phase train
- Evaluate the model: Evaluate the model's performance:
python code_files_nominal_classification/crossencoder_main.py --pipeline_phase test --version_name version_0
Replace "version_0" with the version you want to evaluate in the checkpoints_nominal_classification/CrossEncoderClassifier/lightning_logs folder.
- Evaluate top-k accuracy: Evaluate the model's performance on the top-k frame predictions task (Section 3.1.3):
python code_files_nominal_classification/crossencoder_main.py --pipeline_phase predict_test --version_name version_0
Manual mapping of unlinked synsets
- Generate predictions for expert annotation: Generate the TSV file with top-5 predictions for manual annotation by the experts:
python code_files_nominal_classification/crossencoder_main.py --pipeline_phase predict --version_name version_0
The file is saved in the folder /outputs_nominal_classification/results_CrossEncoderClassifier_{current_timestamp}/results.tsv
- Generate files for expert annotation: Create 2 files for the annotator in the folder /outputs_nominal_classification/results_linguist/:
- results_for_expert.xlsx containing for each predicate, the definition and the top-5 predictions in a human-readable format
- frames_infos.xlsx containing additional information to aid the annotation process
The command requires the path of the top-5 prediction tsv file generated in the previous step:
python code_files_nominal_classification/manual_classification/generate_file_for_linguist.py --pipeline_phase generate --model_results_path <path_to_top5_predictions_tsv_file>
Alternatively, to use our results, don't pass the parameter --model_results_path:
python code_files_nominal_classification/manual_classification/generate_file_for_linguist.py --pipeline_phase generate
Important: Using this command will overwrite the repo's "results_for_expert.xlsx" file, that contains the annotations provided by our experts.
- Parse manually annotated results: Parse the annotations from the "results_for_expert.xlsx" file:
python code_files_nominal_classification/manual_classification/generate_file_for_linguist.py --pipeline_phase parse
Predicate Nominalization
- Generate data with LLM: To generate the nominalized sentences, you can select an LLM from one of the following three providers:
- OpenAI (set the OPENAI_KEY environment variable in the .env file)
- Google AI Studio (set the GOOGLE_KEY environment variable in the .env file)
- Fireworks AI (set the FIREWORKS_KEY environment variable in the .env file) Google's Gemini-Pro is the model used in the paper and selected by default.
python code_files_nominal_srl/llm_nominalization.py --model gemini-pro
Alternatively, you can skip this step by downloading our already processed data: semcor_nominalized_sentences_prompt_format_6@model=gemini-pro@prompt=6@few_shots=10@temperature=0.7@system=1@shuffle_examples=True.zip
Verbal-to-Nominal Role Propagation
- Run the mapping scripts: Execute the following commands to perform verbal-to-nominal role propagation (Section 4.3):
python code_files_nominal_srl/data_mapping/sentence_mapping_rule.py
python code_files_nominal_srl/data_mapping/sentence_mapping_neural.py
or you can skip this part by extracting our already processed files: semcor_nominalized_sentences_prompt_format_6@model=gemini-pro@prompt=6@few_shots=10@temperature=0.7@system=1@shuffle_examples=True.rar semcor_nominalized_sentences_prompt_format_6@model=gemini-pro@prompt=6@few_shots=10@temperature=0.7@system=1@shuffle_examples=True_NN.rar
which are located in: outputs_nominal_srl/mapped_infos/
- Create the final SRL dataset: Generate the dataset to train an SRL model using the following command:
python code_files_nominal_srl/datasets_smart_splitting.py
or you can skip this step and find the final dataset in: outputs_nominal_srl/dataset/semcorgemini_2_nn/
We employed a pre-trained "roberta-base" model, finetuned with the training pipeline from the multi-srl repository. Follow its instructions to train and evaluate the model. The process is the same as reported for the "Ontonotes" dataset, as our dataset in outputs_nominal_srl/dataset/semcorgemini_2_nn/ follows the same format. To simplify reproducing the results in the paper, we provide various configuration files (in *.yaml format) within the resources/roberta_custom_config directory. You can use these files to obtain the same metrics presented in the paper.
This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.