Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT score: maximum at self-comparison, symmetry, invariance to additional items #2728

Open
GPPassos opened this issue Sep 9, 2024 · 2 comments
Assignees
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.4.x

Comments

@GPPassos
Copy link
Contributor

GPPassos commented Sep 9, 2024

🐛 Bug

I would be expecting the following properties of BERTscore:

  1. given a single list of sentences, and comparing all pairs as preds and targets, BERTscore should be maximum when the same sentence is given as pred[i] and target[i].
  2. for the F1 score, the score should be the same inverting the pred and the target.
  3. with idf=False, extending the list of pred and the list of target should not affect the previous input.

There are counterexamples for all of the properties above.

To Reproduce

Steps to reproduce the behavior, run the test suite with the following tests added to test_bertscore.py.

Proposed test suite
@skip_on_connection_issues()
@pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
@pytest.mark.parametrize(
    ["idf", "batch_size"],
    [(False, 1),
     (False, 9),
     (True, 1),
     (True, 9)],
)
def test_bertscore_most_similar(idf: bool, batch_size: int):
    """Tests that BERTScore actually gives the highest score to self-similarity."""
    short = "hello there"
    long = "master kenobi"
    longer = "general kenobi"
    
    sentences = [short, long, longer]
    preds, targets = list(zip(*list(product(sentences,
                                            sentences))))
    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)
    for i in range(len(preds)):
        max_pred = i%(len(sentences))*(1 + len(sentences))
        max_target = int(i/(len(sentences)))*(1 + len(sentences))
        assert score["f1"][i] <= score["f1"][max_pred], \
            f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
        assert score["f1"][i] <= score["f1"][max_target], \
            f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"



@skip_on_connection_issues()
@pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
@pytest.mark.parametrize(
    ["idf", "batch_size"],
    [(False, 1),
     (False, 9),
     (True, 1),
     (True, 9)],
)
def test_bertscore_symmetry(idf: bool, batch_size: int):
    """Tests that BERTscore F1 score is symmetric between reference and prediction.
    As F1 is symmetric, it should also be symmetric."""

    short = "hello there"
    long = "master kenobi"
    longer = "general kenobi"

    sentences = [short, long, longer]
    preds, targets = list(zip(*list(product(sentences,
                                            sentences))))
    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)
    for i in range(len(preds)):
        for j in range(len(targets)):
            if preds[i] == targets[j] and preds[j] == targets[i]:
                assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                    f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
    pass

        
@skip_on_connection_issues()
@pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
@pytest.mark.parametrize(
    ["idf", "batch_size"],
    [(False, 1),
     (False, 3)]
)
def test_bertscore_additional_sentence(idf: bool, batch_size: int):
    """Tests that BERTscore keeps the same scores for previous inputs
    by adding additional elements to the input lists. This should be the case for idf=False."""

    short = "hello there"
    long = "master kenobi"
    longer = "general kenobi"

    preds = [long,long]
    targets = [long,short]

    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)

    longlong = score["f1"][0]
    longshort = score["f1"][1]
    # First index should be the self-comparison - sorting by length should not shuffle this
    assert longlong > longshort
    
    preds = preds + [short, longer]
    targets = targets + [longer, long]

    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)

    # First two indices should be exactly as in the previous call to metric
    assert score["f1"][0] == pytest.approx(longlong)
    assert score["f1"][1] == pytest.approx(longshort)
    # Indices 1 and 2 should also be smaller than self-comparison.
    assert score["f1"][0] > score["f1"][1]
    assert score["f1"][0] > score["f1"][2]
Test results
unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] FAILED                                                        [ 10%]
unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] FAILED                                                        [ 20%]
unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] FAILED                                                         [ 30%]
unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] FAILED                                                         [ 40%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] FAILED                                                            [ 50%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] FAILED                                                            [ 60%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] FAILED                                                             [ 70%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] FAILED                                                             [ 80%]
unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] FAILED                                                 [ 90%]
unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] FAILED                                                 [100%]

================================================================= FAILURES =================================================================
___________________________________________________ test_bertscore_most_similar[False-1] ___________________________________________________

idf = False, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9961) <= tensor(0.9664)

unittests/text/test_bertscore.py:220: AssertionError
___________________________________________________ test_bertscore_most_similar[False-9] ___________________________________________________

idf = False, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9961) <= tensor(0.9664)

unittests/text/test_bertscore.py:220: AssertionError
___________________________________________________ test_bertscore_most_similar[True-1] ____________________________________________________

idf = True, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9942) <= tensor(0.9674)

unittests/text/test_bertscore.py:220: AssertionError
___________________________________________________ test_bertscore_most_similar[True-9] ____________________________________________________

idf = True, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9942) <= tensor(0.9674)

unittests/text/test_bertscore.py:220: AssertionError
_____________________________________________________ test_bertscore_symmetry[False-1] _____________________________________________________

idf = False, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.02979564666748047
E                     Max relative difference: 0.03083609460462107
E                     Index | Obtained   | Expected                    
E                     ()    | 0.96625876 | 0.9960544109344482 ± 1.0e-06

unittests/text/test_bertscore.py:250: AssertionError
_____________________________________________________ test_bertscore_symmetry[False-9] _____________________________________________________

idf = False, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.02979564666748047
E                     Max relative difference: 0.03083609460462107
E                     Index | Obtained   | Expected                    
E                     ()    | 0.96625876 | 0.9960544109344482 ± 1.0e-06

unittests/text/test_bertscore.py:250: AssertionError
_____________________________________________________ test_bertscore_symmetry[True-1] ______________________________________________________

idf = True, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.027047932147979736
E                     Max relative difference: 0.02796625947389248
E                     Index | Obtained | Expected                   
E                     ()    | 0.967163 | 0.994210958480835 ± 9.9e-07

unittests/text/test_bertscore.py:250: AssertionError
_____________________________________________________ test_bertscore_symmetry[True-9] ______________________________________________________

idf = True, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.027047932147979736
E                     Max relative difference: 0.02796625947389248
E                     Index | Obtained | Expected                   
E                     ()    | 0.967163 | 0.994210958480835 ± 9.9e-07

unittests/text/test_bertscore.py:250: AssertionError
_______________________________________________ test_bertscore_additional_sentence[False-1] ________________________________________________

idf = False, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 3)]
    )
    def test_bertscore_additional_sentence(idf: bool, batch_size: int):
        """Tests that BERTscore keeps the same scores for previous inputs
        by adding additional elements to the input lists. This should be the case for idf=False."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        preds = [long,long]
        targets = [long,short]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        longlong = score["f1"][0]
        longshort = score["f1"][1]
        # First index should be the self-comparison - sorting by length should not shuffle this
        assert longlong > longshort
    
        preds = preds + [short, longer]
        targets = targets + [longer, long]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        # First two indices should be exactly as in the previous call to metric
        assert score["f1"][0] == pytest.approx(longlong)
>       assert score["f1"][1] == pytest.approx(longshort)
E       assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07)
E         
E         comparison failed. Mismatched elements: 1 / 1:
E         Max absolute difference: 0.03361696004867554
E         Max relative difference: 0.0336169660598575
E         Index | Obtained  | Expected                    
E         ()    | 0.9999998 | 0.9663828611373901 ± 9.7e-07

unittests/text/test_bertscore.py:289: AssertionError
_______________________________________________ test_bertscore_additional_sentence[False-3] ________________________________________________

idf = False, batch_size = 3

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 3)]
    )
    def test_bertscore_additional_sentence(idf: bool, batch_size: int):
        """Tests that BERTscore keeps the same scores for previous inputs
        by adding additional elements to the input lists. This should be the case for idf=False."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        preds = [long,long]
        targets = [long,short]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        longlong = score["f1"][0]
        longshort = score["f1"][1]
        # First index should be the self-comparison - sorting by length should not shuffle this
        assert longlong > longshort
    
        preds = preds + [short, longer]
        targets = targets + [longer, long]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        # First two indices should be exactly as in the previous call to metric
        assert score["f1"][0] == pytest.approx(longlong)
>       assert score["f1"][1] == pytest.approx(longshort)
E       assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07)
E         
E         comparison failed. Mismatched elements: 1 / 1:
E         Max absolute difference: 0.03361707925796509
E         Max relative difference: 0.033617085269168366
E         Index | Obtained  | Expected                    
E         ()    | 0.9999998 | 0.9663827419281006 ± 9.7e-07

unittests/text/test_bertscore.py:289: AssertionError
========================================================= short test summary info ==========================================================
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] - assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07)
FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] - assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07)
============================================ 10 failed, 33571 deselected, 71 warnings in 14.56s ============================================

Expected behavior

All tests above should pass.

Environment

  • TorchMetrics version (if build from source, add commit SHA): 2cd6f6a
  • Python & PyTorch Version (e.g., 1.0): 3.12.4 & 2.4.1+cu121
  • Any other relevant information such as OS (e.g., Linux): Linux

Additional context

Maybe this is somehow related to tokenisation or the encoding, but I have not confirmed that. Against this hypothesis is the fact that this still happens for batch_size=1.

Seems related to PR #2347 . Perhaps the sorting is still incorrectly done?

I have also checked that some of those fail on the original implementation mentioned of BERT score. I have considered whether those properties maybe are simply not expected to hold, but I have found nothing in either the paper nor in the documentation suggesting that, when idf=False and there is no baseline correction.

I am happy to submit a PR with the above tests, which currently all fail.

@GPPassos GPPassos added bug / fix Something isn't working help wanted Extra attention is needed labels Sep 9, 2024
Copy link

github-actions bot commented Sep 9, 2024

Hi! thanks for your contribution!, great first issue!

@SkafteNicki
Copy link
Member

cc: @stancld opinions on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.4.x
Projects
None yet
Development

No branches or pull requests

4 participants