add surprise similarity score #2287

vincentmin · 2023-08-26T10:13:33Z

This PR implements the surprise similarity score of https://arxiv.org/pdf/2308.09765.pdf.
This paper is also implemented in https://github.com/MeetElise/surprise-similarity. This PR implements a minimalistic version for easier integration with sentence-transformer package.

util.py has grown too large (590 lines) and would benefit from a refactoring into separate modules. This is best done in a separate PR.

tomaarsen · 2024-02-06T20:28:15Z

Hello!

I've resolved some of the merge conflicts to play around with this manually. I've also added a debugging commit to add the surprise similarity to EmbeddingSimilarityEvaluator. I then ran some tests, which resulted in the following:

ensemble is the training data:

CoSENTLoss:

2024-02-06 21:08:50 - Cosine-Similarity :       Pearson: 0.8301 Spearman: 0.8428
2024-02-06 21:08:50 - Manhattan-Distance:       Pearson: 0.8430 Spearman: 0.8386
2024-02-06 21:08:50 - Euclidean-Distance:       Pearson: 0.8436 Spearman: 0.8394
2024-02-06 21:08:50 - Dot-Product-Similarity:   Pearson: 0.4730 Spearman: 0.4651
2024-02-06 21:08:50 - Surprise-Similarity:      Pearson: 0.3608 Spearman: 0.7497

MNRL:

2024-02-06 21:11:20 - Cosine-Similarity :       Pearson: 0.6984 Spearman: 0.6986
2024-02-06 21:11:20 - Manhattan-Distance:       Pearson: 0.7206 Spearman: 0.7144
2024-02-06 21:11:20 - Euclidean-Distance:       Pearson: 0.7211 Spearman: 0.7149
2024-02-06 21:11:20 - Dot-Product-Similarity:   Pearson: 0.4269 Spearman: 0.4124
2024-02-06 21:11:20 - Surprise-Similarity:      Pearson: 0.2740 Spearman: 0.5605

ensemble is not provided:

CoSENTLoss:

2024-02-06 21:15:09 - Cosine-Similarity :       Pearson: 0.8293 Spearman: 0.8417
2024-02-06 21:15:09 - Manhattan-Distance:       Pearson: 0.8426 Spearman: 0.8382
2024-02-06 21:15:09 - Euclidean-Distance:       Pearson: 0.8430 Spearman: 0.8387
2024-02-06 21:15:09 - Dot-Product-Similarity:   Pearson: 0.4765 Spearman: 0.4687
2024-02-06 21:15:09 - Surprise-Similarity:      Pearson: 0.3956 Spearman: 0.7212

MNRL:

2024-02-06 21:13:28 - Cosine-Similarity :       Pearson: 0.6958 Spearman: 0.6975
2024-02-06 21:13:28 - Manhattan-Distance:       Pearson: 0.7111 Spearman: 0.7076
2024-02-06 21:13:28 - Euclidean-Distance:       Pearson: 0.7120 Spearman: 0.7079
2024-02-06 21:13:28 - Dot-Product-Similarity:   Pearson: 0.4387 Spearman: 0.4257
2024-02-06 21:13:28 - Surprise-Similarity:      Pearson: 0.3486 Spearman: 0.5562

As you can see here, the Surprise Similarity seems to result in lower Spearman similarity scores. This should mean that the embeddings when compared via the surprise similarity should correspond less well to the true semantic similarities. With other words, for these experiments, it does not make sense to use the surprise similarity over the cosine similarity.

Please do let me know if I made a mistake with my implementation in EmbeddingSimilarityEvaluator! I think the paper is quite fascinating, and I would love it if the surprise similarity indeed somehow allows for embeddings to be compared more effectively.

(We can revert the Debugging commit if we ever choose to move forward with this PR)

cc: @mlschillo this might also interest you!

Tom Aarsen

tomaarsen · 2024-02-06T22:56:05Z

I can also implement this on top of the BinaryClassificationEvaluator and see if the surprise similarity score helps with classification? It could be worth a shot.

VMinB12 · 2024-02-07T12:58:35Z

Hi @tomaarsen ! It's great to see some interest in this PR, thank you.

I was surprised by your benchmark results and did a small extension to include surprise_dev as well. The difference between surprise_score and surprise_dev is that the former produces a score between -1 and 1 and the latter is unnormalised. This normalisation for surprise_score happens with an error function and I've observed that it asymptotes to 1 very quickly, leading to an accumulation of scores of 1.0, resulting in reduced distinguishability. With surprise_dev I get Pearson: 0.7722 Spearman: 0.7635. This is still lower than the cosine similarity though.

The fact that for surprise_dev the Pearson metric is now of the same order as the Spearman metric would indicate that the reduced Pearson value for similarity_score is due to the non-linearity of the error function. In any case, we should focus on the Spearman metric which measures monotonic relationships.

Maybe @mlschillo can comment on whether this is expected and if it is worthwhile to do a classification type benchmark. A classification type benchmark would be closer to the experiments in the paper than the current benchmark, so if you have time @tomaarsen that would be interesting imo.

tomaarsen · 2024-02-07T16:36:07Z

Hi @tomaarsen ! It's great to see some interest in this PR, thank you.

I was surprised by your benchmark results and did a small extension to include surprise_dev as well. The difference between surprise_score and surprise_dev is that the former produces a score between -1 and 1 and the latter is unnormalised. This normalisation for surprise_score happens with an error function and I've observed that it asymptotes to 1 very quickly, leading to an accumulation of scores of 1.0, resulting in reduced distinguishability.

Indeed. Something interesting that I noticed was that the mean of the surprise_score of the diagonals (i.e. the pairwise scores, the thing that we care about in this evaluator) is a very high 0.94, while the mean of all surprise_score values (i.e. all possible pairs) was a much more normal 0.57.

This is perhaps somewhat indicative of the evaluation set. See here some samples from the training set, searched for "obama": https://huggingface.co/datasets/mteb/stsbenchmark-sts/viewer/default/train?q=obama&row=5726
The "score" refers to the similarity of the two sentences in a range between 0 and 5 (0 is dissimilar, 5 is identical). As you can see, there are pairs with very low scores that are indeed not semantically similar (Obama calls for international front against IS, Obama vows to save Iraqis stranded on mountain), but are clearly on the same topic: it's a hard negative pair. When the ensemble contains all kinds of texts, many of which completely unrelated (e.g. Obama calls for international front against IS vs A plane is taking off.), then I understand that the two semantically dissimilar Obama-related sentences have a higher surprise similarity score.

With other words, this might explain why the mean surprise score of all pairs is 0.94 and not something near 0.5.

In short, the surprise score might make pairs on the same subject/topic but with different semantics have a higher similarity score, which may cause poor performance with hard negative pairs.

Maybe @mlschillo can comment on whether this is expected and if it is worthwhile to do a classification type benchmark. A classification type benchmark would be closer to the experiments in the paper than the current benchmark, so if you have time @tomaarsen that would be interesting imo.

I can definitely have a look at trying to run a classification benchmark when I have a bit more time.

Tom Aarsen

mlschillo · 2024-02-08T15:04:54Z

Hey @VMinB12 and @tomaarsen Thanks for the interest and discussion!
After taking a quick look I agree that surprise doesn't do well in this experiment. I just checked the zero-shot results and the overall result agrees, cosine spearmen is higher.

I definitely agree with what @tomaarsen is pointing out here:

Indeed. Something interesting that I noticed was that the mean of the surprise_score of the diagonals (i.e. the pairwise scores, the thing that we care about in this evaluator) is a very high 0.94, while the mean of all surprise_score values (i.e. all possible pairs) was a much more normal 0.57.

This is perhaps somewhat indicative of the evaluation set. See here some samples from the training set, searched for "obama": https://huggingface.co/datasets/mteb/stsbenchmark-sts/viewer/default/train?q=obama&row=5726 The "score" refers to the similarity of the two sentences in a range between 0 and 5 (0 is dissimilar, 5 is identical). As you can see, there are pairs with very low scores that are indeed not semantically similar (Obama calls for international front against IS, Obama vows to save Iraqis stranded on mountain), but are clearly on the same topic: it's a hard negative pair. When the ensemble contains all kinds of texts, many of which completely unrelated (e.g. Obama calls for international front against IS vs A plane is taking off.), then I understand that the two semantically dissimilar Obama-related sentences have a higher surprise similarity score.

With other words, this might explain why the mean surprise score of all pairs is 0.94 and not something near 0.5.

In short, the surprise score might make pairs on the same subject/topic but with different semantics have a higher similarity score, which may cause poor performance with hard negative pairs.

Perhaps the most important feature of the surprise score is it's dependence on the ensemble, but this does mean that it will fail in some cases. It could be that the construction of this benchmark uses clusters of topics (e.g. kitchen tasks, transportation related, music related, news/current events) and the topic clustering hurts the score's ability to distinguish within a cluster (as in the Obama example.) This might also explain why @VMinB12 is able to do better without normalizing. But I would think this would end up being more of a feature than a bug in classification tasks, so I also think it's worth the experiment.

also want to loop in @tbachlechner @MCMartone in case they have more insightful comments

VMinB12 · 2024-02-08T19:07:41Z

It would be great to have an understanding of how common this pitfall of the surprise score is. Perhaps this dataset is an outlier, or perhaps there is a general insight to be extracted here. Repeating @tomaarsen 's current exercise on other datasets would be insightful here.

VMinB12 · 2024-02-08T22:16:11Z

I did some tests myself using all-MiniLM-L6-v2 and a random collection datasets I pulled from Hugging Face. Not sure about the quality of some of these datasets and whether this evaluation truly makes sense for them. Here are the results:

2024-02-08 23:08:25 - EmbeddingSimilarityEvaluator: Evaluating the model on Ukhushn/home-depot-test dataset:
2024-02-08 23:09:48 - Cosine-Similarity :       Pearson: 0.4157 Spearman: 0.4074
2024-02-08 23:09:48 - Manhattan-Distance:       Pearson: 0.4186 Spearman: 0.4106
2024-02-08 23:09:48 - Euclidean-Distance:       Pearson: 0.4149 Spearman: 0.4074
2024-02-08 23:09:48 - Dot-Product-Similarity:   Pearson: 0.4157 Spearman: 0.4074
2024-02-08 23:09:48 - Surprise-Similarity:      Pearson: 0.2783 Spearman: 0.3593
2024-02-08 23:09:48 - Surprise-Similarity-Dev:  Pearson: 0.3697 Spearman: 0.3592


2024-02-08 23:09:50 - EmbeddingSimilarityEvaluator: Evaluating the model on pranavkotz/anatomy_dataset-test dataset:
2024-02-08 23:09:58 - Cosine-Similarity :       Pearson: 0.9484 Spearman: 0.9297
2024-02-08 23:09:58 - Manhattan-Distance:       Pearson: 0.9430 Spearman: 0.9304
2024-02-08 23:09:58 - Euclidean-Distance:       Pearson: 0.9430 Spearman: 0.9297
2024-02-08 23:09:58 - Dot-Product-Similarity:   Pearson: 0.9484 Spearman: 0.9297
2024-02-08 23:09:58 - Surprise-Similarity:      Pearson: 0.4786 Spearman: 0.8827
2024-02-08 23:09:58 - Surprise-Similarity-Dev:  Pearson: 0.8946 Spearman: 0.8902

On both datasets the surprise score is outperformed by cosine similarity. Do you know of any other datasets that we could include here?

tomaarsen · 2024-02-09T08:03:38Z

I've also done some more experiments (feel free to reproduce with train_sts_qqp.py), now with binary classification of whether two questions on Quora are the same. The model is trained with positive pairs exclusively, and then a binary classification evaluator is used to evaluate how well the model could be used to classify. In particular, it computes the similarity scores for all pairs, and then determines the similarity score threshold on which to classify.

Here are the results on the test set:

2024-02-09 08:53:49 - Read QQP test dataset
2024-02-09 08:53:49 - Binary Accuracy Evaluation of the model on qqp-test dataset:
2024-02-09 08:55:04 - Accuracy with Cosine-Similarity:           74.31  (Threshold: 0.7713)
2024-02-09 08:55:04 - F1 with Cosine-Similarity:                 71.78  (Threshold: 0.7187)
2024-02-09 08:55:04 - Precision with Cosine-Similarity:          61.46
2024-02-09 08:55:04 - Recall with Cosine-Similarity:             86.28
2024-02-09 08:55:04 - Average Precision with Cosine-Similarity:  71.29

2024-02-09 08:55:04 - Accuracy with Manhattan-Distance:           74.75 (Threshold: 193.8427)
2024-02-09 08:55:04 - F1 with Manhattan-Distance:                 72.08 (Threshold: 223.2913)
2024-02-09 08:55:04 - Precision with Manhattan-Distance:          60.27
2024-02-09 08:55:04 - Recall with Manhattan-Distance:             89.63
2024-02-09 08:55:04 - Average Precision with Manhattan-Distance:  71.24

2024-02-09 08:55:04 - Accuracy with Euclidean-Distance:           74.82 (Threshold: 8.8388)
2024-02-09 08:55:04 - F1 with Euclidean-Distance:                 72.11 (Threshold: 9.8055)
2024-02-09 08:55:04 - Precision with Euclidean-Distance:          61.53
2024-02-09 08:55:04 - Recall with Euclidean-Distance:             87.09
2024-02-09 08:55:04 - Average Precision with Euclidean-Distance:  71.22

2024-02-09 08:55:04 - Accuracy with Dot-Product:           71.28        (Threshold: 136.5604)
2024-02-09 08:55:04 - F1 with Dot-Product:                 68.41        (Threshold: 109.5986)
2024-02-09 08:55:04 - Precision with Dot-Product:          55.95
2024-02-09 08:55:04 - Recall with Dot-Product:             88.01
2024-02-09 08:55:04 - Average Precision with Dot-Product:  65.80

2024-02-09 08:55:04 - Accuracy with Surprise-Similarity:           61.62        (Threshold: 1.0000)
2024-02-09 08:55:04 - F1 with Surprise-Similarity:                 66.48        (Threshold: 1.0000)
2024-02-09 08:55:04 - Precision with Surprise-Similarity:          50.40
2024-02-09 08:55:04 - Recall with Surprise-Similarity:             97.61
2024-02-09 08:55:04 - Average Precision with Surprise-Similarity:  50.34

2024-02-09 08:55:04 - Accuracy with Surprise-Similarity-Dev:           70.71    (Threshold: 7.6938)
2024-02-09 08:55:04 - F1 with Surprise-Similarity-Dev:                 68.91    (Threshold: 6.5085)
2024-02-09 08:55:04 - Precision with Surprise-Similarity-Dev:          55.44
2024-02-09 08:55:04 - Recall with Surprise-Similarity-Dev:             91.03
2024-02-09 08:55:04 - Average Precision with Surprise-Similarity-Dev:  62.24

There are a few things to note here:

The chosen threshold for Surprise-Similarity is 1 (or at least, it rounds to 1.0000), likely because so many of the scores were nearly exactly 1. This is indicative that the similarity score is too eager to give really high similarity scores, even when lower scores should be used. For comparison, the cosine similarity threshold was 0.7718 for accuracy. Due to this, the surprise-similarity-dev accuracy is much higher, likely as the threshold can be set more effectively.
The accuracy/F1 for the surprise(-dev) scores is lower than any of the other 4 similarity metrics.

In this experiment, the surprise score is also not a good option, I'm afraid.

Edit:
Not providing an explicit ensemble performs even worse:

2024-02-09 09:16:12 - Accuracy with Surprise-Similarity:           60.98        (Threshold: 1.0000)
2024-02-09 09:16:12 - F1 with Surprise-Similarity:                 64.98        (Threshold: 1.0000)
2024-02-09 09:16:12 - Precision with Surprise-Similarity:          49.53
2024-02-09 09:16:12 - Recall with Surprise-Similarity:             94.43
2024-02-09 09:16:12 - Average Precision with Surprise-Similarity:  49.75

2024-02-09 09:16:12 - Accuracy with Surprise-Similarity-Dev:           68.61    (Threshold: 8.2486)
2024-02-09 09:16:12 - F1 with Surprise-Similarity-Dev:                 64.91    (Threshold: 5.2564)
2024-02-09 09:16:12 - Precision with Surprise-Similarity-Dev:          49.71
2024-02-09 09:16:12 - Recall with Surprise-Similarity-Dev:             93.52
2024-02-09 09:16:12 - Average Precision with Surprise-Similarity-Dev:  60.83

Tom Aarsen

into HEAD

add surprise similarity

293844a

vincentmin mentioned this pull request Aug 26, 2023

Add surprise similarity score #2286

Closed

tomaarsen added 2 commits February 6, 2024 20:47

Merge branch 'master' into pr-2287

d56506a

Debugging: Add Surprise Similarity to Evaluator

c9526e5

Run formatting

3703b81

vincentmin added 2 commits February 7, 2024 13:23

Add surprise_dev calculation to EmbeddingSimilarityEvaluator

4afe8fb

report CoSENT results including surpise_dev

e55fe91

Add evaluation script for similarity metrics

cc3c3bd

Experiment with surprise score in Binary Classification

103744d

tomaarsen added 2 commits February 15, 2024 11:15

Merge branch 'master' of https://github.com/UKPLab/sentence-transformers

28e2aea

into HEAD

Run formatting

c8a63d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add surprise similarity score #2287

add surprise similarity score #2287

vincentmin commented Aug 26, 2023

tomaarsen commented Feb 6, 2024

tomaarsen commented Feb 6, 2024

VMinB12 commented Feb 7, 2024

tomaarsen commented Feb 7, 2024

mlschillo commented Feb 8, 2024

VMinB12 commented Feb 8, 2024

VMinB12 commented Feb 8, 2024

tomaarsen commented Feb 9, 2024 •

edited

Loading

add surprise similarity score #2287

Are you sure you want to change the base?

add surprise similarity score #2287

Conversation

vincentmin commented Aug 26, 2023

tomaarsen commented Feb 6, 2024

tomaarsen commented Feb 6, 2024

VMinB12 commented Feb 7, 2024

tomaarsen commented Feb 7, 2024

mlschillo commented Feb 8, 2024

VMinB12 commented Feb 8, 2024

VMinB12 commented Feb 8, 2024

tomaarsen commented Feb 9, 2024 • edited Loading

tomaarsen commented Feb 9, 2024 •

edited

Loading