Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add surprise similarity score #2287

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

vincentmin
Copy link

This PR implements the surprise similarity score of https://arxiv.org/pdf/2308.09765.pdf.
This paper is also implemented in https://github.com/MeetElise/surprise-similarity. This PR implements a minimalistic version for easier integration with sentence-transformer package.

util.py has grown too large (590 lines) and would benefit from a refactoring into separate modules. This is best done in a separate PR.

@tomaarsen
Copy link
Collaborator

Hello!

I've resolved some of the merge conflicts to play around with this manually. I've also added a debugging commit to add the surprise similarity to EmbeddingSimilarityEvaluator. I then ran some tests, which resulted in the following:

  • ensemble is the training data:
    • CoSENTLoss:
      2024-02-06 21:08:50 - Cosine-Similarity :       Pearson: 0.8301 Spearman: 0.8428
      2024-02-06 21:08:50 - Manhattan-Distance:       Pearson: 0.8430 Spearman: 0.8386
      2024-02-06 21:08:50 - Euclidean-Distance:       Pearson: 0.8436 Spearman: 0.8394
      2024-02-06 21:08:50 - Dot-Product-Similarity:   Pearson: 0.4730 Spearman: 0.4651
      2024-02-06 21:08:50 - Surprise-Similarity:      Pearson: 0.3608 Spearman: 0.7497
      
    • MNRL:
      2024-02-06 21:11:20 - Cosine-Similarity :       Pearson: 0.6984 Spearman: 0.6986
      2024-02-06 21:11:20 - Manhattan-Distance:       Pearson: 0.7206 Spearman: 0.7144
      2024-02-06 21:11:20 - Euclidean-Distance:       Pearson: 0.7211 Spearman: 0.7149
      2024-02-06 21:11:20 - Dot-Product-Similarity:   Pearson: 0.4269 Spearman: 0.4124
      2024-02-06 21:11:20 - Surprise-Similarity:      Pearson: 0.2740 Spearman: 0.5605
      
  • ensemble is not provided:
    • CoSENTLoss:
      2024-02-06 21:15:09 - Cosine-Similarity :       Pearson: 0.8293 Spearman: 0.8417
      2024-02-06 21:15:09 - Manhattan-Distance:       Pearson: 0.8426 Spearman: 0.8382
      2024-02-06 21:15:09 - Euclidean-Distance:       Pearson: 0.8430 Spearman: 0.8387
      2024-02-06 21:15:09 - Dot-Product-Similarity:   Pearson: 0.4765 Spearman: 0.4687
      2024-02-06 21:15:09 - Surprise-Similarity:      Pearson: 0.3956 Spearman: 0.7212
      
    • MNRL:
      2024-02-06 21:13:28 - Cosine-Similarity :       Pearson: 0.6958 Spearman: 0.6975
      2024-02-06 21:13:28 - Manhattan-Distance:       Pearson: 0.7111 Spearman: 0.7076
      2024-02-06 21:13:28 - Euclidean-Distance:       Pearson: 0.7120 Spearman: 0.7079
      2024-02-06 21:13:28 - Dot-Product-Similarity:   Pearson: 0.4387 Spearman: 0.4257
      2024-02-06 21:13:28 - Surprise-Similarity:      Pearson: 0.3486 Spearman: 0.5562
      

As you can see here, the Surprise Similarity seems to result in lower Spearman similarity scores. This should mean that the embeddings when compared via the surprise similarity should correspond less well to the true semantic similarities. With other words, for these experiments, it does not make sense to use the surprise similarity over the cosine similarity.

Please do let me know if I made a mistake with my implementation in EmbeddingSimilarityEvaluator! I think the paper is quite fascinating, and I would love it if the surprise similarity indeed somehow allows for embeddings to be compared more effectively.

(We can revert the Debugging commit if we ever choose to move forward with this PR)

cc: @mlschillo this might also interest you!

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

I can also implement this on top of the BinaryClassificationEvaluator and see if the surprise similarity score helps with classification? It could be worth a shot.

@VMinB12
Copy link

VMinB12 commented Feb 7, 2024

Hi @tomaarsen ! It's great to see some interest in this PR, thank you.

I was surprised by your benchmark results and did a small extension to include surprise_dev as well. The difference between surprise_score and surprise_dev is that the former produces a score between -1 and 1 and the latter is unnormalised. This normalisation for surprise_score happens with an error function and I've observed that it asymptotes to 1 very quickly, leading to an accumulation of scores of 1.0, resulting in reduced distinguishability. With surprise_dev I get Pearson: 0.7722 Spearman: 0.7635. This is still lower than the cosine similarity though.

The fact that for surprise_dev the Pearson metric is now of the same order as the Spearman metric would indicate that the reduced Pearson value for similarity_score is due to the non-linearity of the error function. In any case, we should focus on the Spearman metric which measures monotonic relationships.

Maybe @mlschillo can comment on whether this is expected and if it is worthwhile to do a classification type benchmark. A classification type benchmark would be closer to the experiments in the paper than the current benchmark, so if you have time @tomaarsen that would be interesting imo.

@tomaarsen
Copy link
Collaborator

Hi @tomaarsen ! It's great to see some interest in this PR, thank you.

I was surprised by your benchmark results and did a small extension to include surprise_dev as well. The difference between surprise_score and surprise_dev is that the former produces a score between -1 and 1 and the latter is unnormalised. This normalisation for surprise_score happens with an error function and I've observed that it asymptotes to 1 very quickly, leading to an accumulation of scores of 1.0, resulting in reduced distinguishability.

Indeed. Something interesting that I noticed was that the mean of the surprise_score of the diagonals (i.e. the pairwise scores, the thing that we care about in this evaluator) is a very high 0.94, while the mean of all surprise_score values (i.e. all possible pairs) was a much more normal 0.57.

This is perhaps somewhat indicative of the evaluation set. See here some samples from the training set, searched for "obama": https://huggingface.co/datasets/mteb/stsbenchmark-sts/viewer/default/train?q=obama&row=5726
The "score" refers to the similarity of the two sentences in a range between 0 and 5 (0 is dissimilar, 5 is identical). As you can see, there are pairs with very low scores that are indeed not semantically similar (Obama calls for international front against IS, Obama vows to save Iraqis stranded on mountain), but are clearly on the same topic: it's a hard negative pair. When the ensemble contains all kinds of texts, many of which completely unrelated (e.g. Obama calls for international front against IS vs A plane is taking off.), then I understand that the two semantically dissimilar Obama-related sentences have a higher surprise similarity score.

With other words, this might explain why the mean surprise score of all pairs is 0.94 and not something near 0.5.

In short, the surprise score might make pairs on the same subject/topic but with different semantics have a higher similarity score, which may cause poor performance with hard negative pairs.

Maybe @mlschillo can comment on whether this is expected and if it is worthwhile to do a classification type benchmark. A classification type benchmark would be closer to the experiments in the paper than the current benchmark, so if you have time @tomaarsen that would be interesting imo.

I can definitely have a look at trying to run a classification benchmark when I have a bit more time.

  • Tom Aarsen

@mlschillo
Copy link

Hey @VMinB12 and @tomaarsen Thanks for the interest and discussion!
After taking a quick look I agree that surprise doesn't do well in this experiment. I just checked the zero-shot results and the overall result agrees, cosine spearmen is higher.

I definitely agree with what @tomaarsen is pointing out here:

Indeed. Something interesting that I noticed was that the mean of the surprise_score of the diagonals (i.e. the pairwise scores, the thing that we care about in this evaluator) is a very high 0.94, while the mean of all surprise_score values (i.e. all possible pairs) was a much more normal 0.57.

This is perhaps somewhat indicative of the evaluation set. See here some samples from the training set, searched for "obama": https://huggingface.co/datasets/mteb/stsbenchmark-sts/viewer/default/train?q=obama&row=5726 The "score" refers to the similarity of the two sentences in a range between 0 and 5 (0 is dissimilar, 5 is identical). As you can see, there are pairs with very low scores that are indeed not semantically similar (Obama calls for international front against IS, Obama vows to save Iraqis stranded on mountain), but are clearly on the same topic: it's a hard negative pair. When the ensemble contains all kinds of texts, many of which completely unrelated (e.g. Obama calls for international front against IS vs A plane is taking off.), then I understand that the two semantically dissimilar Obama-related sentences have a higher surprise similarity score.

With other words, this might explain why the mean surprise score of all pairs is 0.94 and not something near 0.5.

In short, the surprise score might make pairs on the same subject/topic but with different semantics have a higher similarity score, which may cause poor performance with hard negative pairs.

Perhaps the most important feature of the surprise score is it's dependence on the ensemble, but this does mean that it will fail in some cases. It could be that the construction of this benchmark uses clusters of topics (e.g. kitchen tasks, transportation related, music related, news/current events) and the topic clustering hurts the score's ability to distinguish within a cluster (as in the Obama example.) This might also explain why @VMinB12 is able to do better without normalizing. But I would think this would end up being more of a feature than a bug in classification tasks, so I also think it's worth the experiment.

also want to loop in @tbachlechner @MCMartone in case they have more insightful comments

@VMinB12
Copy link

VMinB12 commented Feb 8, 2024

It would be great to have an understanding of how common this pitfall of the surprise score is. Perhaps this dataset is an outlier, or perhaps there is a general insight to be extracted here. Repeating @tomaarsen 's current exercise on other datasets would be insightful here.

@VMinB12
Copy link

VMinB12 commented Feb 8, 2024

I did some tests myself using all-MiniLM-L6-v2 and a random collection datasets I pulled from Hugging Face. Not sure about the quality of some of these datasets and whether this evaluation truly makes sense for them. Here are the results:

2024-02-08 23:08:25 - EmbeddingSimilarityEvaluator: Evaluating the model on Ukhushn/home-depot-test dataset:
2024-02-08 23:09:48 - Cosine-Similarity :       Pearson: 0.4157 Spearman: 0.4074
2024-02-08 23:09:48 - Manhattan-Distance:       Pearson: 0.4186 Spearman: 0.4106
2024-02-08 23:09:48 - Euclidean-Distance:       Pearson: 0.4149 Spearman: 0.4074
2024-02-08 23:09:48 - Dot-Product-Similarity:   Pearson: 0.4157 Spearman: 0.4074
2024-02-08 23:09:48 - Surprise-Similarity:      Pearson: 0.2783 Spearman: 0.3593
2024-02-08 23:09:48 - Surprise-Similarity-Dev:  Pearson: 0.3697 Spearman: 0.3592


2024-02-08 23:09:50 - EmbeddingSimilarityEvaluator: Evaluating the model on pranavkotz/anatomy_dataset-test dataset:
2024-02-08 23:09:58 - Cosine-Similarity :       Pearson: 0.9484 Spearman: 0.9297
2024-02-08 23:09:58 - Manhattan-Distance:       Pearson: 0.9430 Spearman: 0.9304
2024-02-08 23:09:58 - Euclidean-Distance:       Pearson: 0.9430 Spearman: 0.9297
2024-02-08 23:09:58 - Dot-Product-Similarity:   Pearson: 0.9484 Spearman: 0.9297
2024-02-08 23:09:58 - Surprise-Similarity:      Pearson: 0.4786 Spearman: 0.8827
2024-02-08 23:09:58 - Surprise-Similarity-Dev:  Pearson: 0.8946 Spearman: 0.8902

On both datasets the surprise score is outperformed by cosine similarity. Do you know of any other datasets that we could include here?

@tomaarsen
Copy link
Collaborator

tomaarsen commented Feb 9, 2024

I've also done some more experiments (feel free to reproduce with train_sts_qqp.py), now with binary classification of whether two questions on Quora are the same. The model is trained with positive pairs exclusively, and then a binary classification evaluator is used to evaluate how well the model could be used to classify. In particular, it computes the similarity scores for all pairs, and then determines the similarity score threshold on which to classify.

Here are the results on the test set:

2024-02-09 08:53:49 - Read QQP test dataset
2024-02-09 08:53:49 - Binary Accuracy Evaluation of the model on qqp-test dataset:
2024-02-09 08:55:04 - Accuracy with Cosine-Similarity:           74.31  (Threshold: 0.7713)
2024-02-09 08:55:04 - F1 with Cosine-Similarity:                 71.78  (Threshold: 0.7187)
2024-02-09 08:55:04 - Precision with Cosine-Similarity:          61.46
2024-02-09 08:55:04 - Recall with Cosine-Similarity:             86.28
2024-02-09 08:55:04 - Average Precision with Cosine-Similarity:  71.29

2024-02-09 08:55:04 - Accuracy with Manhattan-Distance:           74.75 (Threshold: 193.8427)
2024-02-09 08:55:04 - F1 with Manhattan-Distance:                 72.08 (Threshold: 223.2913)
2024-02-09 08:55:04 - Precision with Manhattan-Distance:          60.27
2024-02-09 08:55:04 - Recall with Manhattan-Distance:             89.63
2024-02-09 08:55:04 - Average Precision with Manhattan-Distance:  71.24

2024-02-09 08:55:04 - Accuracy with Euclidean-Distance:           74.82 (Threshold: 8.8388)
2024-02-09 08:55:04 - F1 with Euclidean-Distance:                 72.11 (Threshold: 9.8055)
2024-02-09 08:55:04 - Precision with Euclidean-Distance:          61.53
2024-02-09 08:55:04 - Recall with Euclidean-Distance:             87.09
2024-02-09 08:55:04 - Average Precision with Euclidean-Distance:  71.22

2024-02-09 08:55:04 - Accuracy with Dot-Product:           71.28        (Threshold: 136.5604)
2024-02-09 08:55:04 - F1 with Dot-Product:                 68.41        (Threshold: 109.5986)
2024-02-09 08:55:04 - Precision with Dot-Product:          55.95
2024-02-09 08:55:04 - Recall with Dot-Product:             88.01
2024-02-09 08:55:04 - Average Precision with Dot-Product:  65.80

2024-02-09 08:55:04 - Accuracy with Surprise-Similarity:           61.62        (Threshold: 1.0000)
2024-02-09 08:55:04 - F1 with Surprise-Similarity:                 66.48        (Threshold: 1.0000)
2024-02-09 08:55:04 - Precision with Surprise-Similarity:          50.40
2024-02-09 08:55:04 - Recall with Surprise-Similarity:             97.61
2024-02-09 08:55:04 - Average Precision with Surprise-Similarity:  50.34

2024-02-09 08:55:04 - Accuracy with Surprise-Similarity-Dev:           70.71    (Threshold: 7.6938)
2024-02-09 08:55:04 - F1 with Surprise-Similarity-Dev:                 68.91    (Threshold: 6.5085)
2024-02-09 08:55:04 - Precision with Surprise-Similarity-Dev:          55.44
2024-02-09 08:55:04 - Recall with Surprise-Similarity-Dev:             91.03
2024-02-09 08:55:04 - Average Precision with Surprise-Similarity-Dev:  62.24

There are a few things to note here:

  1. The chosen threshold for Surprise-Similarity is 1 (or at least, it rounds to 1.0000), likely because so many of the scores were nearly exactly 1. This is indicative that the similarity score is too eager to give really high similarity scores, even when lower scores should be used. For comparison, the cosine similarity threshold was 0.7718 for accuracy. Due to this, the surprise-similarity-dev accuracy is much higher, likely as the threshold can be set more effectively.
  2. The accuracy/F1 for the surprise(-dev) scores is lower than any of the other 4 similarity metrics.

In this experiment, the surprise score is also not a good option, I'm afraid.

Edit:
Not providing an explicit ensemble performs even worse:

2024-02-09 09:16:12 - Accuracy with Surprise-Similarity:           60.98        (Threshold: 1.0000)
2024-02-09 09:16:12 - F1 with Surprise-Similarity:                 64.98        (Threshold: 1.0000)
2024-02-09 09:16:12 - Precision with Surprise-Similarity:          49.53
2024-02-09 09:16:12 - Recall with Surprise-Similarity:             94.43
2024-02-09 09:16:12 - Average Precision with Surprise-Similarity:  49.75

2024-02-09 09:16:12 - Accuracy with Surprise-Similarity-Dev:           68.61    (Threshold: 8.2486)
2024-02-09 09:16:12 - F1 with Surprise-Similarity-Dev:                 64.91    (Threshold: 5.2564)
2024-02-09 09:16:12 - Precision with Surprise-Similarity-Dev:          49.71
2024-02-09 09:16:12 - Recall with Surprise-Similarity-Dev:             93.52
2024-02-09 09:16:12 - Average Precision with Surprise-Similarity-Dev:  60.83
  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants