Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Open
Rassibassi opened this issue Jan 15, 2025 · 0 comments

Comments

@Rassibassi
Copy link

Description

When using spacy-transformers with various HuggingFace models, I've discovered inconsistencies in the token alignment data (doc._.trf_data.align) between fast and slow tokenizer implementations. This issue particularly affects DeBERTa models and, to a lesser extent, RoBERTa-based models.

Key Observations

  1. DeBERTa Models: The alignment IDs are being duplicated when using the fast tokenizer:

    • Fast tokenizer produces pairs of duplicate IDs: (1, 2, 2, 3, 3, 4, 4, ...)
    • Slow tokenizer produces sequential IDs: (1, 2, 3, 4, ...)
  2. RoBERTa Models: Shows minor differences in alignment:

    • Different align_data between fast/slow implementations
    • Fast tokenizer: (4, 1, 1, 1, ..., 1, 1, 1, 1, 3, ...)
    • Slow tokenizer: (4, 1, 1, 1, ..., 1, 0, 1, 1, 3, ...)

Verification

  • The issue appears to be specific to spacy-transformers, as direct usage of HuggingFace transformers shows no such discrepancies
  • The differences affect both alignment data and lengths

Reproduction Steps

  1. Run the attached script which tests multiple models with both fast and slow tokenizer implementations
  2. Compare the alignment data and lengths between fast/slow tokenizer variants
  3. Note the systematic duplication in DeBERTa models and the alignment shifts in RoBERTa models

How to reproduce the behaviour

Run the following script:

import warnings

warnings.simplefilter("ignore")

import spacy
from rich import print
from transformers import AutoTokenizer

MODELS = [
    "distilroberta-base",
    # "roberta-base",
    # "intfloat/e5-small-v2",
    "BAAI/bge-small-en-v1.5",
    "microsoft/deberta-v3-xsmall",
    # "microsoft/deberta-v3-small",
    "microsoft/Multilingual-MiniLM-L12-H384",
    # "microsoft/deberta-v3-large",
]

model_max_length = 1024

text = """Copenhagen is the capital and most populous city of Denmark,
with a population of 1.4 million in the urban area."""

for model_name in MODELS:
    wordpieces_strings = []
    wordpieces_input_ids = []
    wordpieces_attention_mask = []

    model_output_last_hidden_state = []
    align_data = []
    align_lengths = []

    print(f"[bold blue]Model: {model_name}[/bold blue]")
    for use_fast in [True, False]:
        nlp = spacy.blank("en")
        config = {
            "model": {
                "@architectures": "spacy-transformers.TransformerModel.v3",
                "name": model_name,
                "tokenizer_config": {
                    "use_fast": use_fast,
                    "model_max_length": model_max_length,
                },
                "get_spans": {"@span_getters": "spacy-transformers.doc_spans.v1"},
            },
        }
        nlp.add_pipe("transformer", config=config)
        nlp.initialize()

        # tokenizer = nlp.get_pipe("transformer").model.tokenizer
        # print(f"[bold blue]Tokenizer: {type(tokenizer)}[/bold blue]")

        doc = nlp(text)

        wordpieces_strings.append(doc._.trf_data.wordpieces.strings[0])
        wordpieces_input_ids.append(
            tuple(doc._.trf_data.wordpieces.input_ids[0].tolist())
        )
        wordpieces_attention_mask.append(
            tuple(doc._.trf_data.wordpieces.attention_mask[0].tolist())
        )

        model_output_last_hidden_state.append(
            doc._.trf_data.model_output["last_hidden_state"].squeeze(0).shape
        )
        align_data.append(tuple(doc._.trf_data.align.data.flatten().tolist()))
        align_lengths.append(tuple(doc._.trf_data.align.lengths.tolist()))

    if wordpieces_strings[0] != wordpieces_strings[1]:
        print("[red]Different wordpieces_strings[/red]")

    if wordpieces_input_ids[0] != wordpieces_input_ids[1]:
        print("[red]Different wordpieces_input_ids[/red]")

    if wordpieces_attention_mask[0] != wordpieces_attention_mask[1]:
        print("[red]Different wordpieces_attention_mask[/red]")

    if model_output_last_hidden_state[0] != model_output_last_hidden_state[1]:
        print("[red]Different model_output_last_hidden_state[/red]")

    if align_data[0] != align_data[1]:
        print(align_data[0])
        print(align_data[1])
        print("[red]Different align_data[/red]")

    if align_lengths[0] != align_lengths[1]:
        print(align_lengths[0])
        print(align_lengths[1])
        print("[red]Different align_lengths[/red]")

    print()

print("[bold purple]Pure huggingface transformers:[/bold purple]")
print()
for model_name in MODELS:
    print(f"[bold blue]Model: {model_name}[/bold blue]")
    inp = []
    att = []
    for use_fast in [True, False]:
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            model_max_length=model_max_length,
            use_fast=use_fast,
        )

        inputs = tokenizer(text)

        input_ids = tuple(inputs["input_ids"])
        attention_mask = tuple(inputs["attention_mask"])

        inp.append(input_ids)
        att.append(attention_mask)

    if inp[0] != inp[1]:
        print("[red]Different input_ids[/red]")

    if att[0] != att[1]:
        print("[red]Different attention masks[/red]")

Output:

Model: distilroberta-base
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
Different align_data
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: BAAI/bge-small-en-v1.5

Model: microsoft/deberta-v3-xsmall
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
Different align_data
(2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 4, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: microsoft/Multilingual-MiniLM-L12-H384
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
Different align_data
(2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Different align_lengths

Pure huggingface transformers:

Model: distilroberta-base
Model: BAAI/bge-small-en-v1.5
Model: microsoft/deberta-v3-xsmall
Model: microsoft/Multilingual-MiniLM-L12-H384

Your Environment

  • Operating System: ubuntu 24
  • Python Version Used: 3.12.3
  • spaCy Version Used: 3.8.3
  • Environment Information:
uv pip list | grep "spacy"                  
spacy                                 3.8.3
spacy-alignments                      0.9.1
spacy-curated-transformers            0.3.0
spacy-legacy                          3.0.12
spacy-loggers                         1.0.5
spacy-lookups-data                    1.0.5
spacy-transformers                    1.3.5
spacy-utils                           0.1.0

uv pip list | grep "transformers"           
curated-transformers                  0.1.1
spacy-curated-transformers            0.3.0
spacy-transformers                    1.3.5
transformers                          4.36.2

Info about spaCy

  • spaCy version: 3.8.3
  • Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Pipelines: en_core_web_trf (3.8.0), en_core_web_sm (3.8.0), en_core_web_lg (3.8.0), en_core_web_md (3.8.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant