Improved the singularize method in inflect.py #220
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Though 95% accuracy was previously achieved by measuring via CELEX English morphology word forms, the following changes have incremented the accuracy to 99%
Added more words to the set singular_uninflected
In the singularize method, changed the if-condition for the set singular_uninflected from
if x.endswith(w): return word
toif x == w or w == x + "s": return x
because the former statement considered the words in the set to be word endings. Hence, it affected words with a prefix to the words in the set.
The new condition checks if the word passed in the argument is present in the given list as it is or with a succeeding "s" and then returns the word's singular form from the list and not the word, which may be passed in a plural form.
Added more words to the list singular_uncountable categorized via commenting such as abstract ideas and expressions, natural phenomena, general, etc for ease of reading and understanding
Added more words to the list singular_ie and dictionaries singular_irregular
Certain words which could be grouped via regex instead of adding in the above-mentioned lists and dictionaries were written in the form of regular expressions (regex) in the singular_rules.
In singularize method, changed the if-condition for the dictionary singular_irregular from
if w.endswith(x):
toif x == w:
because the former considered the word or key x in the dictionary to be an ending to the word passed as an argument to the singularize method. The latter condition checks whether the word w passed as argument is present in the dictionary by equating it to x. If True, it returns the singularized form of word w, that is, singular_irregular[x]
Added more regex expressions to the list singular_rules to suit the singularization rules and improve the accuracy of the singularize method.
Henceforth, this commit solves the following issues opened currently
Issue - singularized on - earlier effect - current effect
Singular form of words ending in 'our' and in 'lives' are incorrect #141 , issue singularizing "flour" #175 - flour - flmy - flour
Singular form of words ending in 'our' and in 'lives' are incorrect #141 - colour - colmy - colour
Singular form of words ending in 'our' and in 'lives' are incorrect #141 - your - ymy - your
Singular form of words ending in 'our' and in 'lives' are incorrect #141 - olives - olife - olive
issue singularizing "hummus" #176 - hummus - hummu - hummus
The words added to sets singular_uninflected and singular_uncountable were also added to the lists in dictionary plural_categories["uninflected"] and plural_categories["uncountable"] for consistency.
It is to keep in mind that the 99% accuracy is reported after being tested from the corpora/test_en.py and is subject to the dataset of CELEX English morphology word forms only.