A collection of awesome resources regarding Entity Resolution.
Entity Resolution (ER) aims to identify different descriptions that refer to the same real-world object. Detecting entities stored in the same database is refeerd to as deduplication, while record linkage refeers to detectation in two different databases.
- Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen (2012)
- Data Quality and Record Linkage Techniques by Thomas N. Herzog, Fritz J. Scheuren & William E. Winkler (2007)
- 2020 | An Overview of End-to-End Entity Resolution for Big Data | Vassilis Christophides, et al. |
pdf
- 2012 | A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication | Peter Christen |
pdf
- 2007 | Duplicate Record Detection: A Survey | Ahmed K. Elmagarmid, et a.l |
pdf
- 2016 | Magellan: Toward Building Entity Matching Management Systems | Pradap Konda, et al. |
pdf
|git
...
- 1969 | A Theory for Record Linkage | Fellegi, I.P., Sunter, A.B. |
pdf
...
- 2019 | Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records | T. Enamorado, et al. |
pdf
|GiT
- 2020 | Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications | Yan Yan, et al. |
pdf
Table 1 is a composition of tools presented in (2020, V. Christophides), (2015, P. Konda) and J535D165/data-matching-software.
Table 1: Blocking: Attribute equivalence (AE), Blocking index (BI), Canopy clustering (CC), Canopy index (CI), Clustering (C), Expectation maximization (EM), Full index (FI), Hash-based (HB), Hybrid (H), Induction (I), Predicate-based (PB), Probabilistic (P), Relational clustering (RC), Rule-based (RB), Sorted neighborhood (SN) Sorting index (SoI), Stringmap index (StI), Suffixarray index (SuI). Matching: Agglomerative hierarchical clustering-based (AHC), Decision trees (DT), Farthest First (FF), Fellegi-Sunter (FS), k-Nearest-neighbour (KNN), Logistic regression (LR), Optimal threshold (OT), Support vector machine (SVM) TwoStep (TS).
Tools | Blocking | Matching | Clustering | UI | Scaling | Language | OSS | GiT/Inst | Paper | |
---|---|---|---|---|---|---|---|---|---|---|
Active Atlas | HB | DT | --- | GUI, CMD | β | Java | β | --- | --- | |
Atyimo | --- | --- | --- | --- | --- | Python | --- | git |
--- | |
BigMatch | AE, RB | β | --- | CMD | βοΈ | C | β | --- | (2002, W. E. Yancey) | |
D-Dupe | AE | RC | --- | GUI, CMD | β | C# | β | --- | (2006, M. Bilgic) | |
Dedoop | AE, SN | DT, LR, SVM, etc | --- | GUI | Hadoop | Java | β | install |
(2012, L Kolb) | |
Dedupe | CC, PB | AHC | β | API, CMD | βοΈ | Python | βοΈ | git |
(2003, M. Bilenko), (2006, M. Bilenko) | |
DuDe | SN | RB | β | CMD | β | Java | βοΈ | install |
(2010, U. Draisbach) | |
Duke | βοΈ | βοΈ | --- | CMD | β | Java | --- | git |
Blog: (2011, L. Marius) | |
FAMER | β | β | βοΈ | --- | Apache Flink | --- | --- | gitlab |
(2018, A Saeedi) | |
fastLink | β | --- | --- | API | βοΈ | R | βοΈ | git |
(2017, T. Enamorado) | |
Febrl | BI, CI, FI, SoI, StI, SuI, Q-gram | FS, OT, K-means, FF, SVM, TS | β | GUI | β | Python | βοΈ | install |
(2013, P. Christen) | |
FRIL | AE, SN | EM | β | GUI | β | Java | βοΈ | install |
(2008, P Jurczyk) | |
JedAI | βοΈ | βοΈ | βοΈ | GUI | Apache Spark | Java | βοΈ | git |
(2020, G. Papadakis) | |
KnoFuss | βοΈ | βοΈ | --- | --- | β | Java | --- | --- | (2008, A. Nikolov) | |
LIMES | --- | --- | --- | GUI | β | Java | βοΈ | git |
(2011, A. C. N. Ngomo) | |
Magellan | βοΈ | βοΈ | β | API, GUI | Apache Spark | Python | βοΈ | git |
(2016, P. Konda) | |
MARLIN | CC | DT, SVM | --- | β | --- | --- | --- | --- | (2004, M. Bilenko) | |
Merge Toolbox | AE, CC | P, EM | --- | GUI | β | Java | β | install |
(2004, R. Schnell) | |
MinoanER | βοΈ | βοΈ | β | GUI | Apache Spark | Java | βοΈ | --- | (2019, V. Efthymiou) | |
NADEEF | --- | RB | --- | GUI | β | Java | β | --- | (2013, M. Dallachiesa) | |
OYSTER | AE | RB | --- | CMD | β | Java | βοΈ | install |
(2011, E. D. Nelson) | |
PRIL | --- | --- | --- | GUI | --- | C# | --- | git |
(2018, C. T. Rentsch) | |
pydedupe | AE | KNN, K-means, RB | --- | CMD | β | Python | βοΈ | git |
--- | |
Reclin2 | --- | --- | --- | API | --- | R | --- | git |
--- | |
RELAIS | --- | --- | --- | GUI | --- | R/Java | --- | install |
(2006, M. Fortini) | |
RLTK | --- | --- | --- | API | --- | Python | βοΈ | git |
--- | |
Record Linkage (R) | AE | ML-based | --- | CMD | β | R | βοΈ | cran |
(2011, M Sariyar) | |
Record Linkage (Python) | FI, BI, SN | DC, LR, SVM, K-means, EM | --- | API | β | Python | βοΈ | git |
2015, inspired by FEBRL | |
SERIMI | βοΈ | βοΈ | --- | --- | --- | Ruby | --- | git |
(2015, S Araujo) | |
SERF | --- | R-swoosh | --- | CMD | β | Java | β | git |
(2009, O. Benjelloun) | |
Splink | βοΈ | EM, etc? | βοΈ | API, GUI | Apache Spark | Python | βοΈ | git |
2019, same as fastLink | |
Silk | --- | RB | --- | GUI | Hadoop | Scala | βοΈ | git , install |
(2009, J. Volz) | |
TAILOR | AE, SN | P, C, H, I | --- | GUI | β | Java | β | --- | (2002, M. G. Elfeky) | |
WHIRL | --- | --- | --- | CMD | β | C++ | β | install |
(2000, W.W Cohen) |
- University of Leipzig: Benchmark datasets for entity resolution
- Restaurant
- Rexa-DBLP
- BBCmusic-DBpedia
- YAGO-IMDb
- List of blog posts on "Probabilistic Record Linkage" by Robin Linacre (Lead developer of Splink).
- TWD series
- GiT: Data Matching software
- Documentation: FAst Multi-source Entity Resolution system (FAMER)
- SERF - Standford Entity Resolution Framework: Homepage
- Silk - related publications.
- Magellan - rlated material.