Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Rassibassi · 2025-01-15T14:26:10Z

Description

When using spacy-transformers with various HuggingFace models, I've discovered inconsistencies in the token alignment data (doc._.trf_data.align) between fast and slow tokenizer implementations. This issue particularly affects DeBERTa models and, to a lesser extent, RoBERTa-based models.

Key Observations

DeBERTa Models: The alignment IDs are being duplicated when using the fast tokenizer:
- Fast tokenizer produces pairs of duplicate IDs: (1, 2, 2, 3, 3, 4, 4, ...)
- Slow tokenizer produces sequential IDs: (1, 2, 3, 4, ...)
RoBERTa Models: Shows minor differences in alignment:
- Different align_data between fast/slow implementations
- Fast tokenizer: (4, 1, 1, 1, ..., 1, 1, 1, 1, 3, ...)
- Slow tokenizer: (4, 1, 1, 1, ..., 1, 0, 1, 1, 3, ...)

Verification

The issue appears to be specific to spacy-transformers, as direct usage of HuggingFace transformers shows no such discrepancies
The differences affect both alignment data and lengths

Reproduction Steps

Run the attached script which tests multiple models with both fast and slow tokenizer implementations
Compare the alignment data and lengths between fast/slow tokenizer variants
Note the systematic duplication in DeBERTa models and the alignment shifts in RoBERTa models

How to reproduce the behaviour

Run the following script:

import warnings

warnings.simplefilter("ignore")

import spacy
from rich import print
from transformers import AutoTokenizer

MODELS = [
    "distilroberta-base",
    # "roberta-base",
    # "intfloat/e5-small-v2",
    "BAAI/bge-small-en-v1.5",
    "microsoft/deberta-v3-xsmall",
    # "microsoft/deberta-v3-small",
    "microsoft/Multilingual-MiniLM-L12-H384",
    # "microsoft/deberta-v3-large",
]

model_max_length = 1024

text = """Copenhagen is the capital and most populous city of Denmark,
with a population of 1.4 million in the urban area."""

for model_name in MODELS:
    wordpieces_strings = []
    wordpieces_input_ids = []
    wordpieces_attention_mask = []

    model_output_last_hidden_state = []
    align_data = []
    align_lengths = []

    print(f"[bold blue]Model: {model_name}[/bold blue]")
    for use_fast in [True, False]:
        nlp = spacy.blank("en")
        config = {
            "model": {
                "@architectures": "spacy-transformers.TransformerModel.v3",
                "name": model_name,
                "tokenizer_config": {
                    "use_fast": use_fast,
                    "model_max_length": model_max_length,
                },
                "get_spans": {"@span_getters": "spacy-transformers.doc_spans.v1"},
            },
        }
        nlp.add_pipe("transformer", config=config)
        nlp.initialize()

        # tokenizer = nlp.get_pipe("transformer").model.tokenizer
        # print(f"[bold blue]Tokenizer: {type(tokenizer)}[/bold blue]")

        doc = nlp(text)

        wordpieces_strings.append(doc._.trf_data.wordpieces.strings[0])
        wordpieces_input_ids.append(
            tuple(doc._.trf_data.wordpieces.input_ids[0].tolist())
        )
        wordpieces_attention_mask.append(
            tuple(doc._.trf_data.wordpieces.attention_mask[0].tolist())
        )

        model_output_last_hidden_state.append(
            doc._.trf_data.model_output["last_hidden_state"].squeeze(0).shape
        )
        align_data.append(tuple(doc._.trf_data.align.data.flatten().tolist()))
        align_lengths.append(tuple(doc._.trf_data.align.lengths.tolist()))

    if wordpieces_strings[0] != wordpieces_strings[1]:
        print("[red]Different wordpieces_strings[/red]")

    if wordpieces_input_ids[0] != wordpieces_input_ids[1]:
        print("[red]Different wordpieces_input_ids[/red]")

    if wordpieces_attention_mask[0] != wordpieces_attention_mask[1]:
        print("[red]Different wordpieces_attention_mask[/red]")

    if model_output_last_hidden_state[0] != model_output_last_hidden_state[1]:
        print("[red]Different model_output_last_hidden_state[/red]")

    if align_data[0] != align_data[1]:
        print(align_data[0])
        print(align_data[1])
        print("[red]Different align_data[/red]")

    if align_lengths[0] != align_lengths[1]:
        print(align_lengths[0])
        print(align_lengths[1])
        print("[red]Different align_lengths[/red]")

    print()

print("[bold purple]Pure huggingface transformers:[/bold purple]")
print()
for model_name in MODELS:
    print(f"[bold blue]Model: {model_name}[/bold blue]")
    inp = []
    att = []
    for use_fast in [True, False]:
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            model_max_length=model_max_length,
            use_fast=use_fast,
        )

        inputs = tokenizer(text)

        input_ids = tuple(inputs["input_ids"])
        attention_mask = tuple(inputs["attention_mask"])

        inp.append(input_ids)
        att.append(attention_mask)

    if inp[0] != inp[1]:
        print("[red]Different input_ids[/red]")

    if att[0] != att[1]:
        print("[red]Different attention masks[/red]")

Output:

Model: distilroberta-base
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
Different align_data
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: BAAI/bge-small-en-v1.5

Model: microsoft/deberta-v3-xsmall
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
Different align_data
(2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 4, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: microsoft/Multilingual-MiniLM-L12-H384
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
Different align_data
(2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Different align_lengths

Pure huggingface transformers:

Model: distilroberta-base
Model: BAAI/bge-small-en-v1.5
Model: microsoft/deberta-v3-xsmall
Model: microsoft/Multilingual-MiniLM-L12-H384

Your Environment

Operating System: ubuntu 24
Python Version Used: 3.12.3
spaCy Version Used: 3.8.3
Environment Information:

uv pip list | grep "spacy"                  
spacy                                 3.8.3
spacy-alignments                      0.9.1
spacy-curated-transformers            0.3.0
spacy-legacy                          3.0.12
spacy-loggers                         1.0.5
spacy-lookups-data                    1.0.5
spacy-transformers                    1.3.5
spacy-utils                           0.1.0

uv pip list | grep "transformers"           
curated-transformers                  0.1.1
spacy-curated-transformers            0.3.0
spacy-transformers                    1.3.5
transformers                          4.36.2

Info about spaCy

spaCy version: 3.8.3
Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Python version: 3.12.3
Pipelines: en_core_web_trf (3.8.0), en_core_web_sm (3.8.0), en_core_web_lg (3.8.0), en_core_web_md (3.8.0)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Rassibassi commented Jan 15, 2025

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Comments

Rassibassi commented Jan 15, 2025

Description

Key Observations

Verification

Reproduction Steps

How to reproduce the behaviour

Your Environment

Info about spaCy