You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nlp = spacy.load('de_core_news_sm')
file_contents = ''
with io.open("test.txt", mode="r", encoding="utf-8") as f:
for line in f:
file_contents = file_contents + line
doc = nlp(unicode(file_contents))
sents = list(doc.sents)
for ent in doc.ents:
print (ent.text, `ent.label_)`
file contents:
Anna ist am 2.Oktober geboren und Uwe ist am 4.Oktober geboren. Sie haben zwei Kinder.
Thanks for the report – and yeah, I've noticed similar issues as well 😞 We'd love to have better models and support a more diverse annotation scheme for other languages, to make it consistent with the English models.
The problem at the moment is that we need to make do with the existing datasets that are available – or produce our own annotations (which we're planning for the future, using Prodigy).
The German entity recognizer is trained on Wikipedia data, which works okay for some cases – but it also has its limitations, especially for texts that are very different from Wikipedia texts. German also doesn't really allow using capitalisation as an indicator for an entity (like English etc.), so the model currently seems to produce a lot of false positives for nouns.
That said, it's also important to keep in mind that the pre-trained models distributed with the library are baseline models that were tuned for the best possible compromise of speed, size, and accuracy and make it easy to get started building your own systems. You almost always want to adjust the model to your specific domain if extracting named entities is important to you. You can find more details on this in the documentation on training and updating models.
How to reproduce the behaviour
code:
file contents:
Anna ist am 2.Oktober geboren und Uwe ist am 4.Oktober geboren. Sie haben zwei Kinder.
result:
(u'Anna',
u'PER')
(u'2.Oktober', u'ORG')
(u'Uwe', u'LOC')
(u'4.Oktober', u'LOC')
Your Environment
The text was updated successfully, but these errors were encountered: