You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using spacy-transformers with various HuggingFace models, I've discovered inconsistencies in the token alignment data (doc._.trf_data.align) between fast and slow tokenizer implementations. This issue particularly affects DeBERTa models and, to a lesser extent, RoBERTa-based models.
Key Observations
DeBERTa Models: The alignment IDs are being duplicated when using the fast tokenizer:
Fast tokenizer produces pairs of duplicate IDs: (1, 2, 2, 3, 3, 4, 4, ...)
Description
When using
spacy-transformers
with various HuggingFace models, I've discovered inconsistencies in the token alignment data (doc._.trf_data.align
) between fast and slow tokenizer implementations. This issue particularly affects DeBERTa models and, to a lesser extent, RoBERTa-based models.Key Observations
DeBERTa Models: The alignment IDs are being duplicated when using the fast tokenizer:
(1, 2, 2, 3, 3, 4, 4, ...)
(1, 2, 3, 4, ...)
RoBERTa Models: Shows minor differences in alignment:
align_data
between fast/slow implementations(4, 1, 1, 1, ..., 1, 1, 1, 1, 3, ...)
(4, 1, 1, 1, ..., 1, 0, 1, 1, 3, ...)
Verification
Reproduction Steps
How to reproduce the behaviour
Run the following script:
Output:
Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: