Akin is a collection of string comparison algorithms for Elixir. This solution was born of a Record Linking project. It combines and modifies The Fuzz and Fuzzy Compare. Algorithms can be called independently or in total to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries.
Table of Contents
Add a dependency in your mix.exs:
deps: [{:akin, "~> 0.2.0"}]
To see all of the avialable algorithms. Hamming Distance is excluded as it only compares strings of equal length. Hamming may be called directly. See: Single Algorithms
iex> Akin.Util.list_algorithms()
["bag_distance", "substring_set", "sorensen_dice", "jaccard", "jaro_winkler",
"levenshtein", "metaphone", "double_metaphone", "substring_double_metaphone", "ngram",
"overlap", "substring_sort", "tversky"]
Compare two strings using all of the available algorithms. The return value is a map of scores for each algorithm.
iex> Akin.compare("weird", "wierd")
%{
bag_distance: 1.0,
sorensen_dice: 0.25,
double_metaphone: 1.0,
jaccard: 0.14,
jaro_winkler: 0.94,
levenshtein: 0.6,
metaphone: 1.0,
ngram: 0.25,
overlap: 0.25,
tversky: 0.14
}
iex> Akin.compare("beginning", "begining")
%{
bag_distance: 0.89,
sorensen_dice: 0.93,
double_metaphone: 1.0,
jaccard: 0.88,
jaro_winkler: 0.95,
levenshtein: 0.89,
metaphone: 1.0,
ngram: 0.88,
overlap: 1.0,
tversky: 0.88
}
Comparison accepts options in a Keyword list.
algorithms
: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0.metric
- algorithm metric. Default is both
- "string": uses string algorithms
- "phonetic": uses phonetic algorithms
unit
- algorithm unit. Default is both.
- "whole": uses algorithms best suited for whole string comparison (distance)
- "partial": uses algorithms best suited for partial string comparison (substring)
level
- level for double phonetic matching. Default is "normal".- "strict": both encodings for each string must match
- "strong": the primary encoding for each string must match
- "normal": the primary encoding of one string must match either encoding of other string (default)
- "weak": either primary or secondary encoding of one string must match one encoding of other string
match_at
: an algorith score equal to or above this value is condsidered a match. Default is 0.9ngram_size
: number of contiguous letters to split strings into. Default is 2.short_length
: qualifies as "short" to recieve a shortness boost. Used by Name Metric. Default is 8.stem
: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Defaultfalse
Restrict the list of algorithms by name or metric and/or unit.
iex> opts = [algorithms: ["bag_distance", "jaccard", "jaro_winkler"]]
iex> Akin.compare("weird", "wierd", opts)
%{
bag_distance: 1.0,
jaccard: 0.14,
jaro_winkler: 0.94
}
iex> opts = [algorithms: [metric: "phonetic", unit: "whole"]]
iex > Akin.compare("weird", "wierd", opts)
%{
double_metaphone: 1.0,
metaphone: 1.0
}
The default ngram size for the algorithms is 2. You can change by setting a value in opts.
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"]])
%{sorensen_dice: 0.25}
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"], ngram_size: 1])
%{sorensen_dice: 0.8}
The default match strictness is "normal" You change it by setting
a value in opts. Currently it only affects the outcomes of the substring_set
and
double_metaphone
algorithms
iex> left = "Alice in Wonderland"
iex> right = "Alice's Adventures in Wonderland"
iex> Akin.compare(left, right, [algorithms: ["substring_set"]])
%{substring_set: 0.85}
iex> Akin.compare(left, right, [algorithms: ["substring_set"], level: "weak"])
%{substring_set: 0.85}
iex> left = "which way"
iex> right = "whitch way"
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "weak"])
%{double_metaphone: 1.0}
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "strict"])
%{double_metaphone: 0.0}
Compare the stemmed version of two strings.
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"]])
%{bag_distance: 0.57, double_metaphone: 0.0}
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"], stem: true])
%{bag_distance: 1.0, double_metaphone: 1.0}
iex> Akin.compare("weird", "wierd", algorithms: ["bag_distance", "jaro_winkler", "jaccard"])
%{bag_distance: 1.0, jaccard: 0.14, jaro_winkler: 0.94}
iex> Akin.compare("weird", "wierd", algorithms: [metric: "string", unit: "whole"], ngram_size: 1)
%{
bag_distance: 1.0,
jaccard: 0.67,
jaro_winkler: 0.94,
levenshtein: 0.6,
sorensen_dice: 0.8,
tversky: 1.0
}
Before being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation & emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms.
"Alice Liddell" becomes
%Akin.Corpus{
list: ["alice", "liddell"],
original: "alice liddell",
set: #MapSet<["alice", "liddell"]>,
stems: ["alic", "liddel"],
string: "aliceliddell"
}
iex> Akin.compare("Hubert Łępicki", "Hubert Lepicki")
%{
bag_distance: 0.92,
dice_sorensen: 0.83,
double_metaphone: 0.0,
jaccard: 0.71,
jaro_winkler: 0.97,
levenshtein: 0.92,
metaphone: 0.0,
ngram: 0.83,
overlap: 0.83,
tversky: 0.71
}
iex> Akin.phonemes("virginia")
["frjn", "frkn"]
iex> Akin.phonemes("beginning")
["bjnnk", "pjnnk", "pknnk"]
iex> Akin.phonemes("wonderland")
["wntrlnt", "antrlnt", "fntrlnt"]
UNDER DEVELOPMENT
Identity is the challenge of author name disambiguation (AND). The aim of AND is to match an author's name to that author when the author appears in a list of many authors. Complexity arises from homonymity (many people with the same name) and synonymity (when one person uses different forms/spellings of their name in publications).
Given the name of an author which is divided into the given, middle, and family name parts (i.e. "Virginia", nil, "Woolf") and a list of possible matching author names, find and return the matches for the author in the list. If initials exist in the left name, a separate comparison is performed for the initals and the sets of the right string.
If the comparison metrics produce a score greater than or equal to 0.9, they considered a match and returned in the list.
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["v woolfe", "v woolf"]
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["virginia woolfe", "v woolf"]
This may not be what you want. There are likely to be unwanted matches.
iex> Akin.match_names("V. Woolf", ["Victor Woolf", "Virginia Woolf", "V White", "V Woolf", "Virginia Woolfe"])
["v woolf", "virginia woolf", "victor woolf"]
Bag Distance
The bag distance is a cheap distance measure which always returns a distance smaller or equal to the edit distance. It's meant to be an efficient approximation of the distance between two strings to quickly rule out strings that are largely different.
Double Metaphone
Calculates the Double Metaphone Phonetic Algorithm metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak.
- "strict": both encodings for each string must match
- "strong": the primary encoding for each string must match
- "normal": the primary encoding of one string must match either encoding of other string (default)
- "weak": either primary or secondary encoding of one string must match one encoding of other string
Hamming Distance
Note: Hamming algorithm is not used in an of the comparison functions because it requires the strings being compared are of the same length. It can be called directly, however, so it is still a part of this library.
The Hamming distance between two strings of equal length is the number of positions at which the corresponding letters are different. Returns the percentage of change needed to the left string of the comparison of one string (left) with another string (right). Returns 0.0 if the strings are not the same length. Returns 1.0 if the string are equal.
Jaccard Similarity
Calculates the similarity of two strings as the size of the intersection divided by the size of the union. Default ngram: 2.
Jaro-Winkler Similarity
Jaro-Winkler calculates the edit distance between two strings. A score of one denotes equality. Unlike the Jaro Similarity, it modifies the prefix scale to gives a more favorable rating to strings that match from the beginning.
Levenshtein Distance
Compare two strings for their Levenshtein score. The score is determined by finding the edit distance: the minimum number of single-character edits needed to change one word into the other. The distance is substracted from 1.0 and then divided by the longest length between the two strings.
Metaphone
Compares two strings by converting each to an approximate phonetic representation in ASCII and then comparing those phoenetic representations. Returns a 1 if the phoentic representations are an exact match.
N-Gram Similarity
Calculates the ngram distance between two strings. Default ngram: 2.
Overlap Metric
Uses the Overlap Similarity metric to compare two strings by tokenizing the strings and measuring their overlap. Default ngram: 1.
Sørensen–Dice
Sørensen–Dice coefficient is calculated using bigrams. The equation is 2nt / nx + ny
where nx is the number of bigrams in string x, ny is the number of bigrams in string y, and nt is the number of bigrams in both strings. For example, the bigrams of night
and nacht
are {ni,ig,gh,ht}
and {na,ac,ch,ht}
. They each have four and the intersection is ht
.
(2 · 1) / (4 + 4) = 0.25
Substring Double Metaphone
Iterate over the cartesian product of the two lists sending each element through
the Double Metaphone using all strictness levels until a true value is found
in the list of returned booleans from the Double Metaphone algorithm. Return the
percentage of true values found. If true is never returned, return 0. Increases
accuracy for search terms containing more than one word.
Substring Set
Splits the strings on spaces, sorts, re-joins, and then determines Jaro-Winkler distance. Best when the strings contain irrelevent substrings.
Substring Sort
Sorts substrings by words, compares the sorted strings in pairs, and returns the maximum ratio. If one strings is signficantly longer than the other, this method will compare matching substrings only.
Tversky
A generalization of Sørensen–Dice and Jaccard.
- Disambiguation Datasets
- Double Metaphone in python
- Fuzzy Compare
- Python Fuzzy Wuzzy (2011)
- ML Authur Block Dismabiguation
- ML Author Name Disambiguation
- Record Linking
- The Fuzz
- Homophones used in testing metaphone algorithms
- Further enhancements to name matching
- Add Damerau-Levenshtein algorithm
- Add Caverphone algorithm