Skip to content

Code repository with customisable Fuzzy Matching scripts in STATA and Python, especially useful when working with datasets containing Hindi text transliterated to English.

Notifications You must be signed in to change notification settings

IDinsight/hindi-fuzzy-merge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hindi-fuzzy-merge

This repository contains customizable Fuzzy Matching scripts written in STATA and Python, expecially useful for datasets containing Hindi text transliterated to English.

Overview

This algorithm is motivated by the fact that Hindi names written in Devanagari script are not transliterated in a consistent way to Latin script. Although fuzzy matching programs exist, most are optimized for text originally written in Latin script, and so they perform poorly when applied to Hindi transliterated names.

We also found that match rates could be improved substantially by taking a stepwise approach, starting with the most certain matches and progressively loosening restrictions. False matches in fuzzy matching algorithms propagate: an early false match that incorrectly removes an individual from the match pool leads the algorithm to make false matches with other individuals in later steps.

By completing more certain matches before moving onto less certain matches, we found that our stepwise algorithm reduced false match rates more than running a fuzzy match program a single time.

Directory Structure

.
|-- hindi-fuzzy-merge
     |-- fuzzymerge-python # Directory with an example of the algorithm implemented in Python for matching household survey results with data collected from school registers
     |-- fuzzymerge-stata # Directory with an example of the algorithm implemented in STATA for matching household census data with voter rolls
     |-- transliteration # Directory with example code for trasliteration of Devanagiri script to English using Polyglot Python package

About

Code repository with customisable Fuzzy Matching scripts in STATA and Python, especially useful when working with datasets containing Hindi text transliterated to English.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published