You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would be expecting the following properties of BERTscore:
given a single list of sentences, and comparing all pairs as preds and targets, BERTscore should be maximum when the same sentence is given as pred[i] and target[i].
for the F1 score, the score should be the same inverting the pred and the target.
with idf=False, extending the list of pred and the list of target should not affect the previous input.
There are counterexamples for all of the properties above.
To Reproduce
Steps to reproduce the behavior, run the test suite with the following tests added to test_bertscore.py.
Proposed test suite
@skip_on_connection_issues()@pytest.mark.skipif(not_TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")@pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)],)deftest_bertscore_most_similar(idf: bool, batch_size: int):
"""Tests that BERTScore actually gives the highest score to self-similarity."""short="hello there"long="master kenobi"longer="general kenobi"sentences= [short, long, longer]
preds, targets=list(zip(*list(product(sentences,
sentences))))
score=bert_score(preds, targets, idf=idf, lang="en",
rescale_with_baseline=False, batch_size=batch_size)
foriinrange(len(preds)):
max_pred=i%(len(sentences))*(1+len(sentences))
max_target=int(i/(len(sentences)))*(1+len(sentences))
assertscore["f1"][i] <=score["f1"][max_pred], \
f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"assertscore["f1"][i] <=score["f1"][max_target], \
f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"@skip_on_connection_issues()@pytest.mark.skipif(not_TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")@pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)],)deftest_bertscore_symmetry(idf: bool, batch_size: int):
"""Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric."""short="hello there"long="master kenobi"longer="general kenobi"sentences= [short, long, longer]
preds, targets=list(zip(*list(product(sentences,
sentences))))
score=bert_score(preds, targets, idf=idf, lang="en",
rescale_with_baseline=False, batch_size=batch_size)
foriinrange(len(preds)):
forjinrange(len(targets)):
ifpreds[i] ==targets[j] andpreds[j] ==targets[i]:
assertscore['f1'][i] ==pytest.approx(score['f1'][j]), \
f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."pass@skip_on_connection_issues()@pytest.mark.skipif(not_TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")@pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 3)])deftest_bertscore_additional_sentence(idf: bool, batch_size: int):
"""Tests that BERTscore keeps the same scores for previous inputs by adding additional elements to the input lists. This should be the case for idf=False."""short="hello there"long="master kenobi"longer="general kenobi"preds= [long,long]
targets= [long,short]
score=bert_score(preds, targets, idf=idf, lang="en",
rescale_with_baseline=False, batch_size=batch_size)
longlong=score["f1"][0]
longshort=score["f1"][1]
# First index should be the self-comparison - sorting by length should not shuffle thisassertlonglong>longshortpreds=preds+ [short, longer]
targets=targets+ [longer, long]
score=bert_score(preds, targets, idf=idf, lang="en",
rescale_with_baseline=False, batch_size=batch_size)
# First two indices should be exactly as in the previous call to metricassertscore["f1"][0] ==pytest.approx(longlong)
assertscore["f1"][1] ==pytest.approx(longshort)
# Indices 1 and 2 should also be smaller than self-comparison.assertscore["f1"][0] >score["f1"][1]
assertscore["f1"][0] >score["f1"][2]
Test results
unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] FAILED [ 10%]unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] FAILED [ 20%]unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] FAILED [ 30%]unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] FAILED [ 40%]unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] FAILED [ 50%]unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] FAILED [ 60%]unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] FAILED [ 70%]unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] FAILED [ 80%]unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] FAILED [ 90%]unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] FAILED [100%]================================================================= FAILURES =================================================================___________________________________________________ test_bertscore_most_similar[False-1] ___________________________________________________idf = False, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
> assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')E i=5max_target=4E assert tensor(0.9961) <= tensor(0.9664)unittests/text/test_bertscore.py:220: AssertionError___________________________________________________ test_bertscore_most_similar[False-9] ___________________________________________________idf = False, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
> assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')E i=5max_target=4E assert tensor(0.9961) <= tensor(0.9664)unittests/text/test_bertscore.py:220: AssertionError___________________________________________________ test_bertscore_most_similar[True-1] ____________________________________________________idf = True, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
> assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')E i=5max_target=4E assert tensor(0.9942) <= tensor(0.9674)unittests/text/test_bertscore.py:220: AssertionError___________________________________________________ test_bertscore_most_similar[True-9] ____________________________________________________idf = True, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
> assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')E i=5max_target=4E assert tensor(0.9942) <= tensor(0.9674)unittests/text/test_bertscore.py:220: AssertionError_____________________________________________________ test_bertscore_symmetry[False-1] _____________________________________________________idf = False, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]:
> assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').E assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06)E E comparison failed. Mismatched elements: 1 / 1:E Max absolute difference: 0.02979564666748047E Max relative difference: 0.03083609460462107E Index | Obtained | Expected E () | 0.96625876 | 0.9960544109344482 ± 1.0e-06unittests/text/test_bertscore.py:250: AssertionError_____________________________________________________ test_bertscore_symmetry[False-9] _____________________________________________________idf = False, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]:
> assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').E assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06)E E comparison failed. Mismatched elements: 1 / 1:E Max absolute difference: 0.02979564666748047E Max relative difference: 0.03083609460462107E Index | Obtained | Expected E () | 0.96625876 | 0.9960544109344482 ± 1.0e-06unittests/text/test_bertscore.py:250: AssertionError_____________________________________________________ test_bertscore_symmetry[True-1] ______________________________________________________idf = True, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]:
> assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').E assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07)E E comparison failed. Mismatched elements: 1 / 1:E Max absolute difference: 0.027047932147979736E Max relative difference: 0.02796625947389248E Index | Obtained | Expected E () | 0.967163 | 0.994210958480835 ± 9.9e-07unittests/text/test_bertscore.py:250: AssertionError_____________________________________________________ test_bertscore_symmetry[True-9] ______________________________________________________idf = True, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]:
> assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').E assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07)E E comparison failed. Mismatched elements: 1 / 1:E Max absolute difference: 0.027047932147979736E Max relative difference: 0.02796625947389248E Index | Obtained | Expected E () | 0.967163 | 0.994210958480835 ± 9.9e-07unittests/text/test_bertscore.py:250: AssertionError_______________________________________________ test_bertscore_additional_sentence[False-1] ________________________________________________idf = False, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 3)] ) def test_bertscore_additional_sentence(idf: bool, batch_size: int): """Tests that BERTscore keeps the same scores for previous inputs by adding additional elements to the input lists. This should be the case for idf=False.""" short = "hello there" long = "master kenobi" longer = "general kenobi" preds = [long,long] targets = [long,short] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) longlong = score["f1"][0] longshort = score["f1"][1] # First index should be the self-comparison - sorting by length should not shuffle this assert longlong > longshort preds = preds + [short, longer] targets = targets + [longer, long] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) # First two indices should be exactly as in the previous call to metric assert score["f1"][0] == pytest.approx(longlong)
> assert score["f1"][1] == pytest.approx(longshort)E assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07)E E comparison failed. Mismatched elements: 1 / 1:E Max absolute difference: 0.03361696004867554E Max relative difference: 0.0336169660598575E Index | Obtained | Expected E () | 0.9999998 | 0.9663828611373901 ± 9.7e-07unittests/text/test_bertscore.py:289: AssertionError_______________________________________________ test_bertscore_additional_sentence[False-3] ________________________________________________idf = False, batch_size = 3 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 3)] ) def test_bertscore_additional_sentence(idf: bool, batch_size: int): """Tests that BERTscore keeps the same scores for previous inputs by adding additional elements to the input lists. This should be the case for idf=False.""" short = "hello there" long = "master kenobi" longer = "general kenobi" preds = [long,long] targets = [long,short] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) longlong = score["f1"][0] longshort = score["f1"][1] # First index should be the self-comparison - sorting by length should not shuffle this assert longlong > longshort preds = preds + [short, longer] targets = targets + [longer, long] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) # First two indices should be exactly as in the previous call to metric assert score["f1"][0] == pytest.approx(longlong)
> assert score["f1"][1] == pytest.approx(longshort)E assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07)E E comparison failed. Mismatched elements: 1 / 1:E Max absolute difference: 0.03361707925796509E Max relative difference: 0.033617085269168366E Index | Obtained | Expected E () | 0.9999998 | 0.9663827419281006 ± 9.7e-07unittests/text/test_bertscore.py:289: AssertionError========================================================= short test summary info ==========================================================FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] - assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07)FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] - assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07)============================================ 10 failed, 33571 deselected, 71 warnings in 14.56s ============================================
Expected behavior
All tests above should pass.
Environment
TorchMetrics version (if build from source, add commit SHA): 2cd6f6a
Python & PyTorch Version (e.g., 1.0): 3.12.4 & 2.4.1+cu121
Any other relevant information such as OS (e.g., Linux): Linux
Additional context
Maybe this is somehow related to tokenisation or the encoding, but I have not confirmed that. Against this hypothesis is the fact that this still happens for batch_size=1.
Seems related to PR #2347 . Perhaps the sorting is still incorrectly done?
I have also checked that some of those fail on the original implementation mentioned of BERT score. I have considered whether those properties maybe are simply not expected to hold, but I have found nothing in either the paper nor in the documentation suggesting that, when idf=False and there is no baseline correction.
I am happy to submit a PR with the above tests, which currently all fail.
The text was updated successfully, but these errors were encountered:
🐛 Bug
I would be expecting the following properties of BERTscore:
idf=False
, extending the list of pred and the list of target should not affect the previous input.There are counterexamples for all of the properties above.
To Reproduce
Steps to reproduce the behavior, run the test suite with the following tests added to test_bertscore.py.
Proposed test suite
Test results
Expected behavior
All tests above should pass.
Environment
Additional context
Maybe this is somehow related to tokenisation or the encoding, but I have not confirmed that. Against this hypothesis is the fact that this still happens for
batch_size=1
.Seems related to PR #2347 . Perhaps the sorting is still incorrectly done?
I have also checked that some of those fail on the original implementation mentioned of BERT score. I have considered whether those properties maybe are simply not expected to hold, but I have found nothing in either the paper nor in the documentation suggesting that, when idf=False and there is no baseline correction.
I am happy to submit a PR with the above tests, which currently all fail.
The text was updated successfully, but these errors were encountered: