HON Lucene Levenstein Distance

Custom patch of the default Lucene Levenstein Distance that uses a more "normalized" measure.

The problem with regular Lucene Levenstein distance is that it prefers additions to swaps in certain cases. For instance, Lucene gives a similarity of 0.83 between "candi" and "candid," but 0.8 between "candi" and "candy," because distances are normalized over the length of the longer string. A traditional Levenstein distance calculation would give the same distance for both.

This implementation attempts to give a more "classic" Levenstein distance by normalizing over the length of the target word, rather than the max length of the two words. This means that, in the example above, the similarity score would be 0.8 for both "candid" and "candy."

This improves spellchecking results in Lucene/Solr because it ensures that results are sorted according to their true edit distance. Without it, "candid" would always be preferred to "candy" as a correction for "candi," even though they should be equally likely in terms of edit distance.

Installation

Download the code and do:

mvn install

Then copy the target/hon-lucene-levenstein-1.0.jar into your Solr war's lib directory.

Then, for every spellchecker that you want to use this Levenstein distance for, add the following to the solrconfig.xml:

<lst name="spellchecker">
   ...
   <str name="distanceMeasure">org.healthonnet.spellcheck.levenstein.NormalizedLevensteinDistance</str>
   ...
</lst>

Developer

Nolan Lawson

Health on the Net Foundation

License

Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.settings		.settings
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HON Lucene Levenstein Distance

Installation

Developer

License

About

Releases

Packages

Languages

healthonnet/hon-lucene-levenstein

Folders and files

Latest commit

History

Repository files navigation

HON Lucene Levenstein Distance

Installation

Developer

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages