Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
davebulaval authored Mar 23, 2024
1 parent 16ccf47 commit e1ed220
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,32 +17,32 @@ checks. For more details, refer to our publicly available article.

> This public version of our model uses the best model trained (where in our article, we present the performance results
> of an average of 10 models) for a more extended period (500 epochs instead of 250). We have observed later that the
> model can further reduce dev loss and increase performance. Also, we have changed the data augmentation technique use
> in the article for a more robust one.
> model can further reduce dev loss and increase performance. Also, we have changed the data augmentation technique used
> in the article for a more robust one, that also includes the commutative property of the meaning function. Namely, Meaning(Sent_a, Sent_b) = Meaning(Sent_b, Sent_a).
- [HuggingFace Model Card](https://huggingface.co/davebulaval/MeaningBERT)

## Sanity Check

Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric.
However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive, since it requires
However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive since it requires
a large dataset
annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between
identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving).
In these tests, the meaning preservation target value is not subjective and does not require human annotation to
measure. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to
be measured. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to
achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are
compared and return a null score (i.e., 0%) if two sentences are completely unrelated.

### Identical sentences
### Identical Sentences

The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass
this test, we count the number of times a metric rating was greater or equal to a threshold value X∈[95, 99] and divide
it by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account
It is calculated by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account
for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of
100%.

### Unrelated sentences
### Unrelated Sentences

Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large
language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely
Expand Down

0 comments on commit e1ed220

Please sign in to comment.