diff --git a/README.md b/README.md index e0298fb..141a29e 100644 --- a/README.md +++ b/README.md @@ -17,32 +17,32 @@ checks. For more details, refer to our publicly available article. > This public version of our model uses the best model trained (where in our article, we present the performance results > of an average of 10 models) for a more extended period (500 epochs instead of 250). We have observed later that the -> model can further reduce dev loss and increase performance. Also, we have changed the data augmentation technique use -> in the article for a more robust one. +> model can further reduce dev loss and increase performance. Also, we have changed the data augmentation technique used +> in the article for a more robust one, that also includes the commutative property of the meaning function. Namely, Meaning(Sent_a, Sent_b) = Meaning(Sent_b, Sent_a). - [HuggingFace Model Card](https://huggingface.co/davebulaval/MeaningBERT) ## Sanity Check Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric. -However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive, since it requires +However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive since it requires a large dataset annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving). In these tests, the meaning preservation target value is not subjective and does not require human annotation to -measure. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to +be measured. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are compared and return a null score (i.e., 0%) if two sentences are completely unrelated. -### Identical sentences +### Identical Sentences The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass this test, we count the number of times a metric rating was greater or equal to a threshold value X∈[95, 99] and divide -it by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account +It is calculated by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of 100%. -### Unrelated sentences +### Unrelated Sentences Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely