NounAtlas: Filling the Gap in Nominal Semantic Role Labeling

This repository contains the code for the paper NounAtlas: Filling the Gap in Nominal Semantic Role Labeling.

Website

NounAtlas website is out! Check out the latest version of our resource in the Download section.

Using the NounAtlas Model

Our model for joint nominal and verbal SRL will be usable soon through the InVeRo API.

Dataset for Nominal SRL

Our dataset for nominal SRL is available on Hugging Face 🤗

https://huggingface.co/datasets/sapienzanlp/nounatlas_srl_corpus

Reproducing the paper results

This section provides instructions to reproduce the experiments and results described in the paper. Each section corresponds to a specific paragraph in the paper, refer to it for further details.

Step 0: Setting Up the Environment

Create a conda environment with Python 3.9 or higher.
Activate the environment.
Navigate to the root of the project.
Install required packages: Open a terminal and run the following command:

pip install -r requirements.txt

Phase 1: Creating a Large-Scale Nominal Predicate Inventory

WordNet-based synset-to-frame mapping

Generate link files: Run the following command to generate the "Unambiguous", "Manually-curated", and "Non-existing" link files (they're already present in the repo):

python code_files_nominal_classification/data_preprocessing/build_datasets.py

Ranking frames for unlinked synsets

Train the Cross-Encoder model: Train the Cross-Encoder model for the verbal to nominal definition grouping task (Section 3.1.3) using the following command:

python code_files_nominal_classification/crossencoder_main.py --pipeline_phase train

Evaluate the model: Evaluate the model's performance:

python code_files_nominal_classification/crossencoder_main.py --pipeline_phase test --version_name version_0

Replace "version_0" with the version you want to evaluate in the checkpoints_nominal_classification/CrossEncoderClassifier/lightning_logs folder.

Evaluate top-k accuracy: Evaluate the model's performance on the top-k frame predictions task (Section 3.1.3):

python code_files_nominal_classification/crossencoder_main.py --pipeline_phase predict_test --version_name version_0

Manual mapping of unlinked synsets

Generate predictions for expert annotation: Generate the TSV file with top-5 predictions for manual annotation by the experts:

python code_files_nominal_classification/crossencoder_main.py --pipeline_phase predict --version_name version_0

The file is saved in the folder /outputs_nominal_classification/results_CrossEncoderClassifier_{current_timestamp}/results.tsv

Generate files for expert annotation: Create 2 files for the annotator in the folder /outputs_nominal_classification/results_linguist/:
- results_for_expert.xlsx containing for each predicate, the definition and the top-5 predictions in a human-readable format
- frames_infos.xlsx containing additional information to aid the annotation process

The command requires the path of the top-5 prediction tsv file generated in the previous step:

python code_files_nominal_classification/manual_classification/generate_file_for_linguist.py --pipeline_phase generate --model_results_path <path_to_top5_predictions_tsv_file>

Alternatively, to use our results, don't pass the parameter --model_results_path:

python code_files_nominal_classification/manual_classification/generate_file_for_linguist.py --pipeline_phase generate

Important: Using this command will overwrite the repo's "results_for_expert.xlsx" file, that contains the annotations provided by our experts.

Parse manually annotated results: Parse the annotations from the "results_for_expert.xlsx" file:

python code_files_nominal_classification/manual_classification/generate_file_for_linguist.py --pipeline_phase parse

Phase 2: Creating a Nominal SRL Dataset

Predicate Nominalization

Generate data with LLM: To generate the nominalized sentences, you can select an LLM from one of the following three providers:
- OpenAI (set the OPENAI_KEY environment variable in the .env file)
- Google AI Studio (set the GOOGLE_KEY environment variable in the .env file)
- Fireworks AI (set the FIREWORKS_KEY environment variable in the .env file) Google's Gemini-Pro is the model used in the paper and selected by default.

python code_files_nominal_srl/llm_nominalization.py --model gemini-pro

Alternatively, you can skip this step by downloading our already processed data: semcor_nominalized_sentences_prompt_format_6@model=gemini-pro@prompt=6@few_shots=10@temperature=0.7@system=1@shuffle_examples=True.zip

Verbal-to-Nominal Role Propagation

Run the mapping scripts: Execute the following commands to perform verbal-to-nominal role propagation (Section 4.3):

python code_files_nominal_srl/data_mapping/sentence_mapping_rule.py
python code_files_nominal_srl/data_mapping/sentence_mapping_neural.py

or you can skip this part by extracting our already processed files: semcor_nominalized_sentences_prompt_format_6@model=gemini-pro@prompt=6@few_shots=10@temperature=0.7@system=1@shuffle_examples=True.rar semcor_nominalized_sentences_prompt_format_6@model=gemini-pro@prompt=6@few_shots=10@temperature=0.7@system=1@shuffle_examples=True_NN.rar

which are located in: outputs_nominal_srl/mapped_infos/

Create the final SRL dataset: Generate the dataset to train an SRL model using the following command:

python code_files_nominal_srl/datasets_smart_splitting.py

or you can skip this step and find the final dataset in: outputs_nominal_srl/dataset/semcorgemini_2_nn/

Phase 3: Train the SRL model on the generated dataset

We employed a pre-trained "roberta-base" model, finetuned with the training pipeline from the multi-srl repository. Follow its instructions to train and evaluate the model. The process is the same as reported for the "Ontonotes" dataset, as our dataset in outputs_nominal_srl/dataset/semcorgemini_2_nn/ follows the same format. To simplify reproducing the results in the paper, we provide various configuration files (in *.yaml format) within the resources/roberta_custom_config directory. You can use these files to obtain the same metrics presented in the paper.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
code_files_nominal_classification		code_files_nominal_classification
code_files_nominal_srl		code_files_nominal_srl
datasets		datasets
outputs_nominal_classification		outputs_nominal_classification
outputs_nominal_srl		outputs_nominal_srl
resources		resources
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
logo.png		logo.png
paper.pdf		paper.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NounAtlas: Filling the Gap in Nominal Semantic Role Labeling

Website

Using the NounAtlas Model

Dataset for Nominal SRL

Reproducing the paper results

Step 0: Setting Up the Environment

Phase 1: Creating a Large-Scale Nominal Predicate Inventory

Phase 2: Creating a Nominal SRL Dataset

Phase 3: Train the SRL model on the generated dataset

License

About

Releases

Packages

Contributors 4

Languages

License

SapienzaNLP/nounatlas

Folders and files

Latest commit

History

Repository files navigation

NounAtlas: Filling the Gap in Nominal Semantic Role Labeling

Website

Using the NounAtlas Model

Dataset for Nominal SRL

Reproducing the paper results

Step 0: Setting Up the Environment

Phase 1: Creating a Large-Scale Nominal Predicate Inventory

Phase 2: Creating a Nominal SRL Dataset

Phase 3: Train the SRL model on the generated dataset

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages