-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dependency sentence segmenter handles newlines inconsistently between languages #13059
Comments
Thanks for reporting this!
Can you elaborate on the inconsistency between languages?
While that is a reasonable take, bear in mind that spaCy's pretrained models (such as I recommend removing such characters from your text or using the sentencizer component (and adjusting it to your use case, if necessary). |
I believe this behaviour occurs much more frequently in Italian than other languages. As well as the examples in the notebook where English seems able to identify 2 sentences where Italian gets 3, I'm working on a partially-parallel corpus and Italian has a mean sents/doc that's noticeably higher than any other language (21 vs 14-16), which makes me think it's an Italian-specific issue.
I was hoping to use the parser approach because the docs don't have ideal punctuation, but I tried the sentencizer with |
It's possible something is going wrong with the whitespace augmentation, which is only supposed to attach whitespace to the preceding token and not create new sentences. We might look into this at a later point. We're using this augmentation with the corpus - feel free to have a closer look and/or train your own model with modified settings: [corpora.train.augmenter]
@augmenters = "spacy.combined_augmenter.v1"
lower_level = 0.1
whitespace_level = 0.1
whitespace_per_token = 0.05
whitespace_variants = "[\" \",\"\\t\",\"\\n\",\"\\u000b\",\"\\f\",\"\\r\",\"\\u001c\",\"\\u001d\",\"\\u001e\",\"\\u001f\",\" \",\"\\u0085\",\"\\u00a0\",\"\\u1680\",\"\\u2000\",\"\\u2001\",\"\\u2002\",\"\\u2003\",\"\\u2004\",\"\\u2005\",\"\\u2006\",\"\\u2007\",\"\\u2008\",\"\\u2009\",\"\\u200a\",\"\\u2028\",\"\\u2029\",\"\\u202f\",\"\\u205f\",\"\\u3000\"]"
orth_level = 0.0
orth_variants = null |
How to reproduce the behaviour
Colab notebook demonstrating problem
When parsing a sentence that contains newlines, the Italian parser sometimes assigns the newline to a sentence by itself, for example:
Produces 3 sentences:
There are various experiments with different combinations of punctuation in the notebook.
Looking at the tokens and their
is_sent_start
property, it seems under some circumstances the\n
andI
tokens are both assigned as the start of a new sentence.I have not been able to cause this problem with
en_core_web_sm
, which always correctly identifies 2 sentences.Although I understand that sentence segmentation based on the dependency parser is probabilistic and not always correct, it seems there's some inconsistency between languages here, and I don't think it would ever be correct for a whitespace token to be assigned as the start of a sentence.
Your Environment
The text was updated successfully, but these errors were encountered: