JLemmaGen is java implmentation of LemmaGen project. It's open source lemmatizer with 15 prebuilted european lexicons. Of course you can build your own lexicon.
LemmaGen project aims at providing standardized open source multilingual platform for lemmatisation.
Project contains 2 libraries:
- lemmagen.jar - implementation of lemmatizer and API for building own lemmatizers
- lemmagen-lucene.jar - lucene filter to lemmatize tokens
- lemmagen-lang.jar - prebuilted lemmatizers from Multext Eastern dictionaries * IMPORTANT! - see License chapter.
Lemmatizer lm = LemmatizerFactory.getPrebuilt("mlteast-en");
assert("be".equals(lm.lemmatize("are")));
Dependency:
<dependency>
<groupId>eu.hlavki.text</groupId>
<artifactId>jlemmagen</groupId>
<version>1.0</version>
</dependency>
Additionally you can add language dictionaries:
<dependency>
<groupId>eu.hlavki.text</groupId>
<artifactId>jlemmagen-lang</groupId>
<version>1.0</version>
</dependency>
You need these jars to integrate with lucene/solr:
- jlemmagen-lucene.jar
- jlemmagen.jar
- jlemmagen-lang.jar
- SLF4J API and implememtation (e.g. slf4j-jdk14.jar)
Example of solr filter definition in schema (e.g. Slovak):
<filter class="org.apache.lucene.analysis.lemmagen.LemmagenFilterFactory" lexicon="mlteast-sk"/>
mvn clean release:prepare release:perform -Darguments='-Dmaven.javadoc.failOnError=false'
git push --follow-tags
All source code is licensed under Apache License 2.0. Important note is that binary rule tree files (*.lem) are NOT licensed under Apache License 2.0 and can be used only for non-commercial projects.