Skip to content

[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

License

Notifications You must be signed in to change notification settings

zjunlp/LREBench

Repository files navigation

LREBench: A low-resource relation extraction benchmark.

This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.

intro

Contents

Environment

To install requirements:

>> conda create -n LREBench python=3.9
>> conda activate LREBench
>> pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113

Datasets

We provide 8 benchmark datasets and prompts used in our experiments.

All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.

Normal Prompt-based Tuning

prompt

1 Initialize Answer Words

Use the command below to get answer words first.

>> python get_label_word.py --modelpath roberta-large --dataset semeval

The {modelpath}_{dataset}.pt will be saved in the dataset folder, and you need to assign the modelpath and dataset with names of the pre-trained language model and the dataset to be used before.

2 Split Datasets

We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.

>> python sample_8shot.py -h
    usage: sample_8shot.py [-h] --input_dir INPUT_DIR --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_dir INPUT_DIR, -i INPUT_DIR
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.
>> python sample_10.py -h
    usage: sample_10.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_file INPUT_FILE, -i INPUT_FILE
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.

For example:

>> python sample_8.py -i dataset/semeval -o dataset/semeval/8-shot
>> cd dataset/semeval
>> mkdir 8-1
>> cp 8-shot/new_rel2id.json 8-1/rel2id.json
>> cp 8-shot/new_test.json 8-1/test.json
>> cp 8-shot/train_8_1.json 8-1/train.json
>> cp 8-shot/unlabel_8_1.json 8-1/label.json

3 Prompt-based Tuning

All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:

>> bash scripts/semeval.sh  # RoBERTa-large
>> bash scripts/CMeIE.sh    # Chinese RoBERTa-large
>> bash scripts/ChemProt.sh # BioBERT-large

4 Different prompts

prompts

Simply add parameters to the scripts.

Template Prompt: --use_template_words 0

Schema Prompt: --use_template_words 0 --use_schema_prompt True

PTR: refer to PTR

Balancing

balance

1 Re-sampling

  • Create the re-sampled training file based on the 10% training set by resample.py.

    >> python resample.py -h
        usage: resample.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --rel_file REL_FILE
    
        optional arguments:
          -h, --help            show this help message and exit
          --input_file INPUT_FILE, -i INPUT_FILE
                                The path of the training file.
          --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                                The directory of the sampled files.
          --rel_file REL_FILE, -r REL_FILE
                                the path of the relation file

    For example,

    >> mkdir dataset/semeval/10sa-1
    >> python resample.py -i dataset/semeval/10/train10per_1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa
    >> cd dataset/semeval
    >> cp rel2id.json test.json 10sa-1/
    >> cp sa/sa_1.json 10sa-1/train.json

2 Re-weighting Loss

Simply add the useloss parameter to script for choosing various re-weighting loss.

For exampe: --useloss MultiFocalLoss. (chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)

Data Augmentation

DA

1 Prepare the environment

>> pip install nlpaug nlpcda

Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).

2 Try different DA methods

We provide many data augmentation methods

  • English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).

  • Chinese (nlpcda): Synonym (-lan==cn)

  • All DA methods can be implemented on contexts, entities and both of them (--locations).

  • Generate augmented data

    >> python DA.py -h
        usage: DA.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --language {en,cn}
                      [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]]
                      [--DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}]
                      [--model_dir MODEL_DIR] [--model_name MODEL_NAME] [--create_num CREATE_NUM] [--change_rate CHANGE_RATE]
    
        optional arguments:
          -h, --help            show this help message and exit
          --input_file INPUT_FILE, -i INPUT_FILE
                                the training set file
          --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                                The directory of the sampled files.
          --language {en,cn}, -lan {en,cn}
                                DA for English or Chinese
          --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]
                                List of positions that you want to manipulate
          --DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}, -d {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}
                                Data augmentation method
          --model_dir MODEL_DIR, -m MODEL_DIR
                                the path of pretrained models used in DA methods
          --model_name MODEL_NAME, -mn MODEL_NAME
                                model from huggingface
          --create_num CREATE_NUM, -cn CREATE_NUM
                                The number of samples augmented from one instance.
          --change_rate CHANGE_RATE, -cr CHANGE_RATE
                                the changing rate of text

    Take context-level DA based on contextual word embedding on ChemProt for example:

    python DA.py \
        -i dataset/ChemProt/10/train10per_1.json \
        -o dataset/ChemProt/aug \
        -d word_embedding_bert \
        -mn dmis-lab/biobert-large-cased-v1.1 \
        -l sent1 sent2 sent3
  • Delete repeated instances and get the final augmented data

    >> python merge_dataset.py -h
    usage: merge_dataset.py [-h] [--input_files INPUT_FILES [INPUT_FILES ...]] [--output_file OUTPUT_FILE]
    
    optional arguments:
      -h, --help            show this help message and exit
      --input_files INPUT_FILES [INPUT_FILES ...], -i INPUT_FILES [INPUT_FILES ...]
                            List of input files containing datasets to merge
      --output_file OUTPUT_FILE, -o OUTPUT_FILE
                            Output file containing merged dataset

    For example:

    python merge_dataset.py \
        -i dataset/ChemProt/train10per_1.json dataset/ChemProt/aug/aug.json \
        -o dataset/ChemProt/aug/merge.json

Self-training for Semi-supervised learning

st
  • Train a teacher model on a few labeled data (8-shot or 10%)
  • Place the unlabeled data label.json in the corresponding dataset folder.
  • Assigning pseudo labels using the trained teacher model: add --labeling True to the script and obtain the pseudo-labeled dataset label2.json.
  • Put the gold-labeled data and pseudo-labeled data together. For example:
    >> python self-train_combine.py -g dataset/semeval/10-1/train.json -p dataset/semeval/10-1/label2.json -la dataset/semeval/10la-1
    >> cd dataset/semeval
    >> cp rel2id.json test.json 10la-1/
  • Train the final student model: add --stutrain True to the script

Standard Fine-tuning Baseline

ft

Fine-tuning

Citation

If you use the code, please cite the following paper:

@inproceedings{xu-etal-2022-towards-realistic,
    title = "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study",
    author = "Xu, Xin  and
      Chen, Xiang  and
      Zhang, Ningyu  and
      Xie, Xin  and
      Chen, Xi  and
      Chen, Huajun",
    editor = "Goldberg, Yoav  and
      Kozareva, Zornitsa  and
      Zhang, Yue",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.29",
    doi = "10.18653/v1/2022.findings-emnlp.29",
    pages = "413--427"
}