German model lemmatizes punctuation inconsistently #13729

Michael-E-Rose · 2025-01-15T08:33:27Z

I'm on Python 3.11 with spacy 3.7.4 and noted an inconsistent behavior when I lemmatize my German text.

How to reproduce the behaviour

import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp("(Das ist ein Test!)")
for token in doc:
    print(f"Text: '{token.text}', Lemma: '{token.lemma_}'")

Output:

Text: '(', Lemma: '--'
Text: 'Das', Lemma: 'der'
Text: 'ist', Lemma: 'sein'
Text: 'ein', Lemma: 'ein'
Text: 'Test', Lemma: 'Test'
Text: '!', Lemma: '--'
Text: ')', Lemma: '--'

However, note the English standard model:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("(This is a test!)")
for token in doc:
    print(f"Text: '{token.text}', Lemma: '{token.lemma_}'")

Output:

Text: '(', Lemma: '('
Text: 'This', Lemma: 'this'
Text: 'is', Lemma: 'be'
Text: 'a', Lemma: 'a'
Text: 'test', Lemma: 'test'
Text: '!', Lemma: '!'
Text: ')', Lemma: ')'

On StackOverlow, the answer-giver reported the Dutch model treats punctuation as in English.

Your Environment

Operating System: Windows
Python Version Used: 3.11
spaCy Version Used: 3.7.4
Environment Information: locally (no venv or container)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German model lemmatizes punctuation inconsistently #13729

German model lemmatizes punctuation inconsistently #13729

Michael-E-Rose commented Jan 15, 2025 •

edited

Loading

German model lemmatizes punctuation inconsistently #13729

German model lemmatizes punctuation inconsistently #13729

Comments

Michael-E-Rose commented Jan 15, 2025 • edited Loading

How to reproduce the behaviour

Your Environment

Michael-E-Rose commented Jan 15, 2025 •

edited

Loading