-
Notifications
You must be signed in to change notification settings - Fork 1
This python code, when given a term and a list of terms, gives the possible duplicates of the given term in the given list of terms. The basic idea is to make use of edit distance and longest common subsequences, not just with immediate matches but also with the matches of the matches!
gopalkoduri/string-matching
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Purpose: -------- This is a simple program to find out the wrong/other spellings of a given word. It works with two string distance measures - longest common subsequence, and damerau levenshtein distance. Further, the core strength lies in being able to search not only for direct matches, but also for the matches of matches and so on. This is particularly useful in the cases such as these: Sowrashtram matches Sourashtram but not Saurashtram. But Sourashtram matches Saurashtram. Of course, loosening the threshold can help, but also decreases the precision. Therefore the solution is to search with "tight" parameters, but with an extensive search mechanism. Usage: ------ >>> import stringDuplicates >>> terms = ["kopala", "gopal", "george", "mohammed", "arjuna"] >>> stringDuplicates.stringDuplicates("gopala", terms, simThresh=0.8, recursion=1) ['gopal', 'kopala'] The two crucial parameters, as can be understood from the description, are simThresh and recursion. Contact Info: ------------- Gopala Krishna Koduri gopala.koduri -AT- gmail.com
About
This python code, when given a term and a list of terms, gives the possible duplicates of the given term in the given list of terms. The basic idea is to make use of edit distance and longest common subsequences, not just with immediate matches but also with the matches of the matches!
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published