Skip to content

Ruby & C implementation of Jaro-Winkler distance algorithm which supports UTF-8 string.

License

Notifications You must be signed in to change notification settings

tepperly/jaro_winkler

 
 

Repository files navigation

Build Status

About

It's an implementation of Jaro-Winkler distance algorithm, it uses C extension and will fallback to pure Ruby version in JRuby. Both of them supports UTF-8 string.

Installation

gem install jaro_winkler

Usage

require 'jaro_winkler'
JaroWinkler.distance "MARTHA", "MARHTA"
# => 0.9611
JaroWinkler.distance "MARTHA", "marhta", ignore_case: true
# => 0.9611
JaroWinkler.distance "MARTHA", "MARHTA", weight: 0.2
# => 0.9778

# Force the strategy
JaroWinkler.c_distance "MARTHA", "MARHTA" # C extension
JaroWinkler.r_distance "MARTHA", "MARHTA" # Pure Ruby

Options

Name Type Default Note
ignore_case boolean false All lower case characters are converted to upper case prior to the comparison.
weight number 0.1 A constant scaling factor for how much the score is adjusted upwards for having common prefixes.
threshold number 0.7 The prefix bonus is only added when the compared strings have a Jaro distance above the threshold.
adj_table boolean false The option is used to give partial credit for characters that may be errors due to known phonetic or character recognition errors. A typical example is to match the letter "O" with the number "0".

About Adjusting Table

Default Table

['A', 'E'], ['A', 'I'], ['A', 'O'], ['A', 'U'], ['B', 'V'], ['E', 'I'], ['E', 'O'], ['E', 'U'], ['I', 'O'], ['I', 'U'],
['O', 'U'], ['I', 'Y'], ['E', 'Y'], ['C', 'G'], ['E', 'F'], ['W', 'U'], ['W', 'V'], ['X', 'K'], ['S', 'Z'], ['X', 'S'],
['Q', 'C'], ['U', 'V'], ['M', 'N'], ['L', 'I'], ['Q', 'O'], ['P', 'R'], ['I', 'J'], ['2', 'Z'], ['5', 'S'], ['8', 'B'],
['1', 'I'], ['1', 'L'], ['0', 'O'], ['0', 'Q'], ['C', 'K'], ['G', 'J'], ['E', ' '], ['Y', ' '], ['S', ' ']

How it works?

Original Formula:

origin

where

  • m is the number of matching characters.
  • t is half the number of transpositions.

With Adjusting Table:

adj

where

  • s is the number of nonmatching but similar characters.

Difference Between v1.3.1 And v1.3.2.beta

Version Algorithm
v1.3.1 One linked list to store sparse matrix and iterate to find similar character.
v1.3.2.beta One hash table with multiple linked lists for collision handling.

In theory, the latter should work more efficient than the former (more test data needed).

Why This?

There is also another similar gem named fuzzy-string-match which both provides C and Ruby version as well.

I reinvent this wheel because of the naming in fuzzy-string-match such as getDistance breaks convention, and some weird code like a1 = s1.split( // ) (s1.chars could be better), furthermore, it's bugged (see tables below).

Compare with other gems

            | jaro_winkler | fuzzystringmatch | hotwater    | amatch

--------------- | ------------ | ---------------- | -------- | ------ UTF-8 Suport | Yes | Pure Ruby only | No | No Windows Support | Yes | | No | Yes Adjusting Table | Yes | No | No | No Native | Yes | Yes | Yes | Yes Pure Ruby | Yes | Yes | No | No Speed | Medium | Fast | Medium | Slow Bug Found | Not Yet | Yes | Not Yet | Yes

For Bug Found, I made a rake task to build the table below, the source code is in Rakefile:

str_1 str_2 origin jaro_winkler fuzzystringmatch hotwater amatch
"henka" "henkan" 0.9667 0.9667 0.9722 0.9667 0.9444
"al" "al" 1.0 1.0 1.0 1.0 1.0
"martha" "marhta" 0.9611 0.9611 0.9611 0.9611 0.9444
"jones" "johnson" 0.8324 0.8324 0.8324 0.8324 0.7905
"abcvwxyz" "cabvwxyz" 0.9583 0.9583 0.9583 0.9583 0.9583
"dwayne" "duane" 0.84 0.84 0.84 0.84 0.8222
"dixon" "dicksonx" 0.8133 0.8133 0.8133 0.8133 0.7667
"fvie" "ten" 0.0 0.0 0.0 0.0 0.0

Benchmark

Pure Ruby

             | user     | system   | total    | real

---------------- | -------- | -------- | -------- | ------------ jaro_winkler | 1.300000 | 0.000000 | 1.300000 | ( 1.299802) fuzzystringmatch | 1.510000 | 0.000000 | 1.510000 | ( 1.510136)

  • jaro_winkler (1.3.1)
  • fuzzy-string-match (0.9.6)

Native

             | user     | system   | total    | real

---------------- | -------- | -------- | -------- | ------------ jaro_winkler | 0.350000 | 0.010000 | 0.360000 | ( 0.345293) fuzzystringmatch | 0.140000 | 0.000000 | 0.140000 | ( 0.138711) hotwater | 0.310000 | 0.000000 | 0.310000 | ( 0.306498) amatch | 0.960000 | 0.000000 | 0.960000 | ( 0.961509)

  • jaro_winkler (1.3.1)
  • fuzzy-string-match (0.9.6)
  • hotwater (0.1.2)
  • amatch (0.3.0)

Todo

  • Custom adjusting word table.

About

Ruby & C implementation of Jaro-Winkler distance algorithm which supports UTF-8 string.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Ruby 59.6%
  • C 40.4%