Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German stemmer doesn't match schlummert/schlummern or grüßend/gegrüßt/grüßen #139

Open
bkazez opened this issue Nov 2, 2020 · 1 comment

Comments

@bkazez
Copy link

bkazez commented Nov 2, 2020

Hello,

I'm using Snowball via Elasticsearch, which is based on Lucene. The Snowball German stemming is not matching some common forms:

  • "schlummert" should match "schlummern" (infinitive) but instead is unchanged
  • "grüßend" should match "grüßen" (infinitive) but instead yields "grussend"
  • "gegrüßt" should match "grüßen" (infinitive) but instead yields "gegrusst"

Original Lucene bug was here: https://issues.apache.org/jira/browse/LUCENE-9410?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17217670#comment-17217670

@ojwb
Copy link
Member

ojwb commented Oct 15, 2024

Looks like I didn't explicitly repeat the advice from #91 here, but to achieve what you ask for we would need a way to remove these suffixes (or prefix in the case of ge-) that doesn't negatively affect words that happen to end in t, or n or start with ge where it shouldn't be removed. If we're unable to come up with such a rule then it's better to not try to remove (because understemming is generally less problematic than overstemming), but it would be useful to note the limitation in the algorithm description on the website.

The website does actually already note ge- as "almost intractable", though in the "germanic" overview page rather than the page about the German stemmer:

the almost intractable problems of [...] prefixed and infixed ge

For example, you want ge removed from gegrußt but we shouldn't remove it from some other words - here are a some cases I trivially found from a grep in our German word list for words starting ge which are the same length and also end in a consonant and t:

  • gedeiht
  • gelangt
  • gelingt
  • genießt
  • gesellt
  • gesteht
  • gewöhnt
  • Gedicht (particularly unhelpful as it would get conflated with dicht which has a totally different meaning)
  • Gewicht

The last two are nouns so should be capitalised in text, but the current expectation is that input it lower-cased before being fed to the stemmer so we can't use the capitalisation as a clue. Potentially that could be changed, but doing so would be somewhat disruptive for users of the stemmers so it's not a simple change to make. It would also need to deal with words which aren't nouns being capitalised at the start of a sentence, in titles, etc.

A solution doesn't have to be perfect, it just needs to not be harmful to other cases, so if there's a rule we can use to identify a significant number of cases where ge- should be removed without triggering in cases where it shouldn't be removed we could use that.

Removing -t and -d also seems hard to do without removing it from words which just happen to end with these letters. Even trying a more targetted rule for your particular example of just removing -ert is problematic as it would conflate e.g. hundert and Hund. Similarly, just removing -end would affect Tugend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants