SentenceTokenizer incorrectly processes punctuation marks within words #402

nataliashitova · 2019-11-19T14:00:04Z

Explanation

The currently used SentenceTokenizer generates wrong results when a punctuation mark such as ! or ? or . are used within a word (e.g., in a company name).

Examples

Example 1

The following text (see Yoast/wordpress-seo#13726)

The free App FRITZ!App WLAN helps to find the ideal locations when setting up a repeater.

gets incorrectly parsed into the following sentences

0: "The free App FRITZ!"
1: "App WLAN helps to find the ideal locations when setting up a repeater."

Example 2

The same text as in Example 1 but with a . instead of the !

The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater.

gets correcty parsed into one sentence

0: "The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater."

Example 3

The same text as in Example 2 but the entire word FRITZ.APP capitalized

The free App FRITZ.APP WLAN helps to find the ideal locations when setting up a repeater.

gets incorrectly parsed into the following sentences

0: "The free App FRITZ."
1: "APP WLAN helps to find the ideal locations when setting up a repeater."

Why does it happen?

The problem in Example 1 occurs because the SentenceTokenizer splits text on !, ?, ; and ... without checking if the cut-off part begins as a proper sentence should (e.g., with a space and a capital letter). Here is the rule where this check should take place.

Note that such a check is implemented for the situation when the text is split on a .. Specifically, the rule checks if the second letter of the cut-off remainder text is a capital letter, or a number, etc.
However, the SentenceTokenizer does not check that the first letter of the remainder text is a space. Which is a reason why the problem in Example 3 occurs.

Things to consider

A fix for both problems seems to be pretty straight-forward to implement.
A few users complained about these issues.
The currently used SentenceTokenizer will not be used in its current form when the tree-based text parser is implemented, because the said tokenizer relies on HTML tags.
We will still need a variant of a sentence tokenizer to be able to operate with sentences in researches. The work on implementing fixes to the current sentence tokenizer will not necessarily be lost therefore.

The text was updated successfully, but these errors were encountered:

nataliashitova added the Package: yoastseo label Nov 19, 2019

atimmer mentioned this issue Nov 28, 2019

Create a linguistic parser #406

Closed

7 tasks

manuelaugustin mentioned this issue Jan 22, 2020

LIN-80 Create sentence parser #459

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentenceTokenizer incorrectly processes punctuation marks within words #402

SentenceTokenizer incorrectly processes punctuation marks within words #402

nataliashitova commented Nov 19, 2019 •

edited

Loading

SentenceTokenizer incorrectly processes punctuation marks within words #402

SentenceTokenizer incorrectly processes punctuation marks within words #402

Comments

nataliashitova commented Nov 19, 2019 • edited Loading

Explanation

Examples

Example 1

Example 2

Example 3

Why does it happen?

Things to consider

nataliashitova commented Nov 19, 2019 •

edited

Loading