SimCSE_PRO

This repo contains code for my undergrad thesis.

Main Points

Introduced Dynamic Difficulty
Incorporated Soft Prompts
Explored Discrete Optimization through Gumbel Softmax
- Separate repo here
Devised Self-Weighted Sentence Embedding

Dynamic Difficulty

Formula

$ \hat{\boldsymbol{h}}_i^- = \begin{cases} \boldsymbol{h}_i^-& d^-\leq{d^+}\ \boldsymbol{h}_i + \lambda (\boldsymbol{h}_i^{-}-\boldsymbol{h}_i)& d^->d^+\ d^- = \lVert\boldsymbol{h}_i^{-}-\boldsymbol{h}_i\rVert\ d^+ = \lVert\boldsymbol{h}_i^{+}-\boldsymbol{h}_i\rVert\ \end{cases} $

$ \begin{cases} \lambda = (\frac{d^+}{d^-})^{\zeta\cdot p_{avg}^{+}}& \zeta \in (0,1)\ p_{avg}^{+} = \frac{1}{100}\sum_{j\in[-100,-1]}{\exp (-\mathcal{L}^{j})} \end{cases}$

Notes: $p_{avg}^{+}$ maintains average accuracy for correctly classifying positive instances over the last 100 steps.

Illustration

Details

Inspired by 2021 ICLR paper On Learning Universal Representations Across Languages.

Intuition: move negative sample closer to input, according to previous training losses.

Soft Prompts

Prefix-Tuning style soft prompts (backbone frozen)
Output of [MASK] as representation, inspired by PromptBert
Two set of parameters for soft prompts (Dual Prompt)
- For constructing positive instances

Illustration

Self-Weighted Sentence Embedding

This is a degenerated version of the Gumbel Softmax style optimization that I envisioned. Applying Gumbel Softmax to a token is equivalent to applying a continuous mask in $[0, 1]$ to all its Attention Scores. Yet the Attention mechanism is more fine-grained and expressive. It turns out that only the last hidden states lack a re-weighting.

Illustration

Experiment Results

Dynamic Difficulty

Soft Prompts

Note: includes ablation of Dual Prompt

Self-Weighted Sentence Embedding

Ablation

Note: ablation discards learned weights and uses avg pooling instead.

Explicability

The learned weights matches human intuition, putting less weight on articles (a, the), punctuations (",", ".") and special tokens ([CLS], [SEP]).

Combining Methods

As it turned out, it is difficult to combine the strength of separate methods. Also, too many hyperparameters exist.

Implementation Details

I refactored the code to decouple the training pipeline and provide support for highly customizable modules. This repo now supports:

Custom Model Class
- as well as custom model init helper function if needed
Custom Model Forward Function, including
- Custom Training Forward Function
- Custom Inference Forward Function
Custom Dataset Mapping Function for data preprocessing
Custom Trainer
- also Custom Batcher Function for STS evaluation process during training
Custom Collator

The original SimCSE uses Huggingface Parser for passing arguments. In order to simplify bash scripts for training and evaluation, as well as to leverage hieracical configurations and automatic logging features, I switched to using Hydra. The evaluation code is modified accordingly so that the training script can be seamlessly used for evaluation. It would be really helpful if you are familiar with the original SimCSE repo. Below is adapted from the original SimCSE README.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

This repository adapts the code and pre-trained models for paper SimCSE: Simple Contrastive Learning of Sentence Embeddings.

Overview

SimCSE propose a simple contrastive learning framework that works with both unlabeled and labeled data. Unsupervised SimCSE simply takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise. Our supervised SimCSE incorporates annotated pairs from NLI datasets into contrastive learning by using entailment pairs as positives and contradiction pairs as hard negatives. The following figure is an illustration of our models.

Model List

Our released models are listed as following. You can import these models by using the simcse package or using HuggingFace's Transformers.

Model	Avg. STS
princeton-nlp/unsup-simcse-bert-base-uncased	76.25
princeton-nlp/unsup-simcse-bert-large-uncased	78.41
princeton-nlp/unsup-simcse-roberta-base	76.57
princeton-nlp/unsup-simcse-roberta-large	78.90
princeton-nlp/sup-simcse-bert-base-uncased	81.57
princeton-nlp/sup-simcse-bert-large-uncased	82.21
princeton-nlp/sup-simcse-roberta-base	82.52
princeton-nlp/sup-simcse-roberta-large	83.76

Naming rules: unsup and sup represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.

Reproduction Results

I have tried to reproduce the results in the paper following the exact hyperparameters, yet here is a comparison on the results.

On average, the alledged performance is higher. I have stuck to reproduction results.

Running the Code

Download Datasets

Download training datasets by running data/download_wiki.sh and data/download_nli.sh.

For unsupervised SimCSE, we sample 1 million sentences from English Wikipedia; for supervised SimCSE, we use the SNLI and MNLI datasets.

Download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.7.1 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.7.1 should also work. For example, if you use Linux and CUDA11 (how to check CUDA version), install PyTorch by the following command,

pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

If you instead use CUDA <11 or CPU, install PyTorch by the following command,

pip install torch==1.7.1

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Training

Training scripts All training scripts can be found under scripts/.

All arguments are passed through hydra configs. configs/default.yaml provides an example. Other variants inherit or overwrite it. The configs are maintained in a hierarchical layout.

All SimCSE arguments should work well. Initial descriptions are as follows:

--train_file: Training file path. We support "txt" files (one line for one sentence) and "csv" files (2-column: pair data with no hard negative; 3-column: pair data with one corresponding hard negative instance). You can use our provided Wikipedia or NLI data, or you can use your own data with the same format.
--model_name_or_path: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased, bert-large-uncased, etc.) and RoBERTa-based models (RoBERTa-base, RoBERTa-large, etc.).
--temp: Temperature for the contrastive loss.
--pooler_type: Pooling method. It's the same as the --pooler_type in the evaluation part.
--mlp_only_train: We have found that for unsupervised SimCSE, it works better to train the model with MLP layer but test the model without it. You should use this argument when training unsupervised SimCSE models.
--hard_negative_weight: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value).
--do_mlm: Whether to use the MLM auxiliary objective. If True:
- --mlm_weight: Weight for the MLM objective.
- --mlm_probability: Masking rate for the MLM objective.

Other arguments are standard Huggingface's transformers training arguments. Some of the often-used arguments are: --output_dir, --learning_rate, --per_device_train_batch_size. In our example scripts, we also set to evaluate the model on the STS-B development set and save the best checkpoint.

For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

Hyperparameters

We use the following hyperparamters for training SimCSE:

	Unsup. BERT	Unsup. RoBERTa	Sup.
Batch size	64	512	512
Learning rate (base)	3e-5	1e-5	5e-5
Learning rate (large)	1e-5	3e-5	1e-5

Evaluation

Evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See paper (Appendix B) for evaluation details.

scripts/eval.sh is a sample evaluation script.

SimcSE arguments for the evaluation script are as follows:

--model_name_or_path: The name or path of a transformers-based pre-trained checkpoint. You can directly use the models in the above table, e.g., princeton-nlp/sup-simcse-bert-base-uncased.
--pooler: Pooling method. Now we support
- cls (default): Use the representation of [CLS] token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation). If you use supervised SimCSE, you should use this option.
- cls_before_pooler: Use the representation of [CLS] token without the extra linear+activation. If you use unsupervised SimCSE, you should take this option.
- avg: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa (paper), you should use this option.
- avg_top2: Average embeddings of the last two layers.
- avg_first_last: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best.
--mode: Evaluation mode
- test (default): The default test mode. To faithfully reproduce our results, you should use this option.
- dev: Report the development set results. Note that in STS tasks, only STS-B and SICK-R have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than the test mode (though numbers are slightly lower).
- fasttest: It is the same as test, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks).
--task_set: What set of tasks to evaluate on (if set, it will override --tasks)
- sts (default): Evaluate on STS tasks, including STS 12~16, STS-B and SICK-R. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings.
- transfer: Evaluate on transfer tasks.
- full: Evaluate on both STS and transfer tasks.
- na: Manually set tasks by --tasks.
--tasks: Specify which dataset(s) to evaluate on. Will be overridden if --task_set is not na. See the code for a full list of tasks.

Citation

Please cite our paper if you use SimCSE in your work:

@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}

If you find this code useful, feel free to give credit~

@Misc{Zhejian2022SimCSE_PRO,
  author =       {Zhejian, Zhou},
  title =        {SimCSE_PRO},
  howpublished = {Github},
  year =         {2022},
  url =          {https://github.com/Ja1Zhou/SimCSE_PRO}
}

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.vscode		.vscode
SentEval		SentEval
argument_classes		argument_classes
configs		configs
data		data
demo		demo
explicability		explicability
figure		figure
logs		logs
scripts		scripts
simcse		simcse
slides		slides
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluation.py		evaluation.py
requirements.txt		requirements.txt
self_weighted_explicability.py		self_weighted_explicability.py
setup.py		setup.py
simcse_to_huggingface.py		simcse_to_huggingface.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimCSE_PRO

Table of Contents

Main Points

Dynamic Difficulty

Formula

Illustration

Details

Soft Prompts

Illustration

Self-Weighted Sentence Embedding

Illustration

Experiment Results

Dynamic Difficulty

Soft Prompts

Self-Weighted Sentence Embedding

Ablation

Explicability

Combining Methods

Implementation Details

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Quick Links

Overview

Model List

Reproduction Results

Running the Code

Download Datasets

Requirements

Training

Evaluation

Citation

About

Releases

Packages

Languages

License

Ja1Zhou/SimCSE_PRO

Folders and files

Latest commit

History

Repository files navigation

SimCSE_PRO

Table of Contents

Main Points

Dynamic Difficulty

Formula

Illustration

Details

Soft Prompts

Illustration

Self-Weighted Sentence Embedding

Illustration

Experiment Results

Dynamic Difficulty

Soft Prompts

Self-Weighted Sentence Embedding

Ablation

Explicability

Combining Methods

Implementation Details

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Quick Links

Overview

Model List

Reproduction Results

Running the Code

Download Datasets

Requirements

Training

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages