Skip to content
This repository has been archived by the owner on Oct 4, 2022. It is now read-only.

SentenceTokenizer incorrectly processes punctuation marks within words #402

Open
nataliashitova opened this issue Nov 19, 2019 · 0 comments

Comments

@nataliashitova
Copy link
Contributor

nataliashitova commented Nov 19, 2019

Explanation

The currently used SentenceTokenizer generates wrong results when a punctuation mark such as ! or ? or . are used within a word (e.g., in a company name).

Examples

Example 1

The following text (see Yoast/wordpress-seo#13726)

The free App FRITZ!App WLAN helps to find the ideal locations when setting up a repeater.

gets incorrectly parsed into the following sentences

0: "The free App FRITZ!"
1: "App WLAN helps to find the ideal locations when setting up a repeater."

Example 2

The same text as in Example 1 but with a . instead of the !

The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater.

gets correcty parsed into one sentence

0: "The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater."

Example 3

The same text as in Example 2 but the entire word FRITZ.APP capitalized

The free App FRITZ.APP WLAN helps to find the ideal locations when setting up a repeater.

gets incorrectly parsed into the following sentences

0: "The free App FRITZ."
1: "APP WLAN helps to find the ideal locations when setting up a repeater."

Why does it happen?

The problem in Example 1 occurs because the SentenceTokenizer splits text on !, ?, ; and ... without checking if the cut-off part begins as a proper sentence should (e.g., with a space and a capital letter). Here is the rule where this check should take place.

Note that such a check is implemented for the situation when the text is split on a .. Specifically, the rule checks if the second letter of the cut-off remainder text is a capital letter, or a number, etc.
However, the SentenceTokenizer does not check that the first letter of the remainder text is a space. Which is a reason why the problem in Example 3 occurs.

Things to consider

  1. A fix for both problems seems to be pretty straight-forward to implement.
  2. A few users complained about these issues.
  3. The currently used SentenceTokenizer will not be used in its current form when the tree-based text parser is implemented, because the said tokenizer relies on HTML tags.
  4. We will still need a variant of a sentence tokenizer to be able to operate with sentences in researches. The work on implementing fixes to the current sentence tokenizer will not necessarily be lost therefore.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant