Skip to content

This python code, when given a term and a list of terms, gives the possible duplicates of the given term in the given list of terms. The basic idea is to make use of edit distance and longest common subsequences, not just with immediate matches but also with the matches of the matches!

Notifications You must be signed in to change notification settings

gopalkoduri/string-matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Purpose:
--------

This is a simple program to find out the wrong/other spellings of a 
given word. It works with two string distance measures - longest common
subsequence, and damerau levenshtein distance. 

Further, the core strength lies in being able to search not only for
direct matches, but also for the matches of matches and so on. This is
particularly useful in the cases such as these:

Sowrashtram matches Sourashtram but not Saurashtram.
But Sourashtram matches Saurashtram.

Of course, loosening the threshold can help, but also decreases the 
precision. Therefore the solution is to search with "tight" parameters,
but with an extensive search mechanism.

Usage:
------

>>> import stringDuplicates
>>> terms = ["kopala", "gopal", "george", "mohammed", "arjuna"]
>>> stringDuplicates.stringDuplicates("gopala", terms, simThresh=0.8, recursion=1)
['gopal', 'kopala']

The two crucial parameters, as can be understood from the description,
are simThresh and recursion.


Contact Info:
-------------

Gopala Krishna Koduri
gopala.koduri -AT- gmail.com

About

This python code, when given a term and a list of terms, gives the possible duplicates of the given term in the given list of terms. The basic idea is to make use of edit distance and longest common subsequences, not just with immediate matches but also with the matches of the matches!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages