Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (5): 27-40    DOI: 10.11925/infotech.2096-3467.2018.1388
Current Issue | Archive | Adv Search |
Reviewing Basic Methods of Entity Resolution
Guangshang Gao()
Business School, Guilin University of Technology, Guilin 541004, China
Download: PDF (851 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper discusses the classical entity resolution methods and logical thinking in entity resolution theory. [Coverage] Google Scholar and CNKI were respectively used to search literatures with the keywords “Entity Resolution”, “Collective Analysis”, “Crowdsourced”, “Active Learning”, “Privacy-Preserving” and “Entity Resolution” in Chinese. I then obtained a total of 86 representative literatures in conjunction with topic screening, intensive reading and retrospective method. [Methods] For each entity resolution method, the paper first summarizes and analyzes the basic idea of the method, and presents the resolution process through illustration, and then focuses on analyzing the key strategies, algorithms or techniques adopted by the existing research in the process of implementation of the method. [Results] Entity resolution is the basic operation of data quality management, and the key step to find the value of data. [Limitations] There is no in-depth analysis of the evaluation indicators and application of each entity resolution method. [Conclusions] Although existing entity resolution methods can meet the requirements of most applications to some extent, they still face challenges in data heterogeneity, privacy protection and distributed environment in the big data environment.

Key wordsEntity Resolution      Collective Analysis      Crowdsourced      Active Learning      Privacy-Preserving     
Received: 07 December 2018      Published: 03 July 2019

Cite this article:

Guangshang Gao. Reviewing Basic Methods of Entity Resolution. Data Analysis and Knowledge Discovery, 2019, 3(5): 27-40.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1388     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I5/27

[1] 孟小峰, 杜治娟. 大数据融合研究: 问题与挑战[J]. 计算机研究与发展, 2016, 53(2): 231-246.
[1] (Meng Xiaofeng, Du Zhijuan.Research on the Big Data Fusion: Issues and Challenges[J]. Journal of Computer Research and Development, 2016, 53(2): 231-246.)
[2] 李建中, 王宏志, 高宏. 大数据可用性的研究进展[J]. 软件学报, 2016, 27(7): 1605-1625.
[2] (Li Jianzhong, Wang Hongzhi, Gao Hong.State-of-the-Art of Research on Big Data Usability[J]. Journal of Software, 2016, 27(7): 1605-1625.)
[3] Dong X L, Srivastava D.Big Data Integration[C]// Proceedings of the 29th International Conference on Data Engineering. 2013: 1245-1248.
[4] Getoor L, Machanavajjhala A.Entity Resolution: Theory, Practice & Open Challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019.
[5] Dong X L, Rekatsinas T.Data Integration and Machine Learning: A Natural Synergy[C]// Proceedings of the 2018 International Conference on Management of Data. 2018: 1645-1650.
[6] Newcombe H B, Kennedy J M, Axford S J, et al.Automatic Linkage of Vital Records[J]. Science, 1959, 130(3381): 954-959.
[7] Talburt J R.Entity Resolution and Information Quality[M]. Elsevier, 2011.
[8] Fellegi I P, Sunter A B.A Theory for Record Linkage[J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[9] Cohen W W, Ravikumar P, Fienberg S E.A Comparison of String Metrics for Matching Names and Records[A]// KDD Workshop on Data Cleaning & Object Consolidation[M]. 2003, 4(2): 73-78.
[10] Naumann F, Herschel M.An Introduction to Duplicate Detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.
[11] Doan A, Halevy A, Ives Z.Principles of Data Integration[M]. Elsevier, 2012.
[12] Levenshtein V I.Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Doklady Akademii Nauk SSSR, 1965, 163(4): 845-848.
[13] Jaro M A.Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida[J]. Journal of the American Statistical Association, 1989, 84(406): 414-420.
[14] Winkler W E, Thibaudeau Y.An Application of the Felligi Sunter Model of Record Linkage to the 1990 US Decennial Census[R]. Washington: US Bureau of the Census, 1991.
[15] On B W, Koudas N, Lee D, et al.Group Linkage[C]// Proceedings of the 23rd International Conference on Data Engineering. 2007: 496-505.
[16] Bilenko M, Mooney R J, Cohen W W, et al.Adaptive Name Matching in Information Integration[J]. IEEE Intelligent Systems, 2003, 18(5): 16-23.
[17] Cohen W W, Ravikumar P, Fienberg S E.A Comparison of String Distance Metrics for Name-Matching Tasks[C]// Proceedings of the 2003 International Joint Conference on Artificial Intelligence. 2003: 73-78.
[18] Monge A E, Elkan C.The Field Matching Problem: Algorithms and Applications[C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996: 267-270.
[19] Whang S E, Garcia-Molina H.Developments in Generic Entity Resolution[A]// Bulletin of the Technology Committee on Data Engineering[M]. IEEE Computer Society, 2011.
[20] Benjelloun O, Garcia-Molina H, Kawai H, et al.Generic Entity Resolution in the Serf Project[R]. Stanford InfoLab, 2006.
[21] Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh: A Generic Approach to Entity Resolution[J]. The VLDB Journal, 2009, 18(1): 255-276.
[22] Kawai H, Garcia-Molina H, Benjelloun O, et al.P-Swoosh: Parallel Algorithm for Generic Entity Resolution[R]. 2006.
[23] Kim H S, Lee D.Parallel Linkage[C]// Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007: 283-292.
[24] Kawai H, Garcia-Molina H, Benjelloun O, et al.Bufoosh: Buffering Algorithms for Generic Entity Resolution[R]. 2006.
[25] Benjelloun O, Garcia-Molina H, Gong H, et al.D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution[C]// Proceedings of the 27th International Conference on Distributed Computing Systems. IEEE, 2007.
[26] Bhattacharya I, Getoor L.Collective Entity Resolution in Relational Data[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 5-18.
[27] Malin B, Airoldi E, Carley K M.A Network Analysis Model for Disambiguation of Names in Lists[J]. Computational & Mathematical Organization Theory, 2005, 11(2): 119-139.
[28] 高学东, 黄月. 异质对象协同实体解析的联合聚类算法[J]. 系统工程理论与实践, 2015, 35(4): 997-1004.
[28] (Gao Xuedong, Huang Yue.Co-Clustering Algorithm for Collective Entity Resolution of Multi-Typed Objects[J]. Systems Engineering - Theory & Practice, 2015, 35(4): 997-1004.)
[29] Dong X, Halevy A, Madhavan J.Reference Reconciliation in Complex Information Spaces[C]// Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. 2005: 85-96.
[30] 孙琛琛, 申德荣, 寇月, 等. 面向关联数据的联合式实体识别方法[J]. 计算机学报, 2015, 38(9): 1739-1754.
[30] (Sun Chenchen, Shen Derong, Kou Yue, et al.A Related Data Oriented Joint Entity Resolution Approach[J]. Chinese Journal of Computers, 2015, 38(9): 1739-1754.)
[31] Kalashnikov D V.Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph[J]. ACM Transactions on Database Systems, 2006, 31(2): 716-767.
[32] 冯剑红, 李国良, 冯建华. 众包技术研究综述[J]. 计算机学报, 2015, 38(9): 1713-1726.
[32] (Feng Jianhong, Li Guoliang, Feng Jianhua.A Survey on Crowdsourcing[J]. Chinese Journal of Computers, 2015, 38(9): 1713-1726.)
[33] Lee J, Cho H, Park J W, et al.Hybrid Entity Clustering Using Crowds and Data[J]. The VLDB Journal, 2013, 22(5): 711-726.
[34] Wang J, Kraska T, Franklin M J, et al.CrowdER: Crowdsourcing Entity Resolution[J]. Proceedings of the VLDB Endowment, 2012, 5(11): 1483-1494.
[35] Wang S, Xiao X, Lee C H.Crowd-Based Deduplication: An Adaptive Approach[C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 2015: 1263-1277.
[36] Demartini G, Difallah D E, Cudré-Mauroux P, et al.ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking[C]// Proceedings of the 21st International Conference on World Wide Web. 2012: 469-478.
[37] Liu X, Lu M, Ooi B C, et al.CDAS: A Crowdsourcing Data Analytics System[J]. Proceedings of the VLDB Endowment, 2012, 5(10): 1040-1051.
[38] Settles B.Active Learning Literature Survey[R]. University of Wisconsinmadison, 2009.
[39] 杨文柱, 田潇潇, 王思乐, 等. 主动学习算法研究进展[J]. 河北大学学报:自然科学版, 2017, 37(2): 216-224.
[39] (Yang Wenzhu, Tian Xiaoxiao, Wang Sile, et al.Recent Advances in Active Learning Algorithms[J]. Journal of Hebei University: Natural Science Edition, 2017, 37(2): 216-224.)
[40] Sarawagi S, Bhamidipaty A.Interactive Deduplication Using Active Learning[C]// Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002: 269-278.
[41] Tejada S, Knoblock C A, Minton S.Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification[C]// Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002: 350-359.
[42] Bellare K, Iyengar S, Parameswaran A G, et al.Active Sampling for Entity Matching[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012: 1131-1139.
[43] Arasu A, Kaushik R.On Active Learning of Record Matching Packages[C]// Proceeding of the 10th ACM SIGMOD International Conference on Management of Data. 2010: 783-794.
[44] Qian K, Popa L, Sen P.Active Learning for Large-Scale Entity Resolution[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1379-1388.
[45] Fisher J, Christen P, Wang Q.Active Learning Based Entity Resolution Using Markov Logic[C]// Proceedings of the 2016 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2016: 338-349.
[46] Fu Y, Zhu X, Li B.A Survey on Instance Selection for Active Learning[J]. Knowledge and Information Systems, 2013, 35(2): 249-283.
[47] Ramadan B.Indexing Techniques for Real-Time Entity Resolution[D]. Canberra: The Australian National University, 2016.
[48] Ramadan B, Christen P, Liang H, et al. Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution[J]. Journal of Data and Information Quality (JDIQ), 2015, 6(4): Article No.15.
[49] Bayardo R J, Ma Y, Srikant R.Scaling up All Pairs Similarity Search[C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 131-140.
[50] Broder A Z, Carmel D, Herscovici M, et al.Efficient Query Evaluation Using a Two-Level Retrieval Process[C]// Proceedings of the 12th International Conference on Information and Knowledge Management. 2003: 426-434.
[51] Zobel J, Moffat A.Inverted Files for Text Search Engines[J]. ACM Computing Surveys(CSUR), 2006, 38(2): 6.
[52] Christen P.A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537-1555.
[53] Christen P, Gayler R, Hawking D.Similarity-Aware Indexing for Real-Time Entity Resolution[C]// Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009: 1565-1568.
[54] Ramadan B, Christen P, Liang H, et al.Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution[C]// Proceedings of the 2013 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2013: 47-58.
[55] Ramadan B, Christen P.Unsupervised Blocking Key Selection for Real-Time Entity Resolution[C]// Proceedings of the 2015 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2015: 574-585.
[56] Papadakis G, Koutrika G, Palpanas T, et al.Meta-Blocking: Taking Entity Resolutionto the Next Level[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8): 1946-1960.
[57] Efthymiou V, Papadakis G, Papastefanatos G, et al.Parallel Meta-Blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data[C]// Proceedings of the 2015 IEEE International Conference on Big Data. 2015: 411-420.
[58] Dragut E C, Ouzzani M, Elmagarmid A K.Query-Time Record Linkage and Fusion over Web Databases[C]// Proceedings of the 31st International Conference on Data Engineering. IEEE, 2015: 42-53.
[59] Gruenheid A, Dong X L, Srivastava D.Incremental Record Linkage[J]. Proceedings of the VLDB Endowment, 2014, 7(9): 697-708.
[60] Whang S E, Marmaros D, Garcia-Molina H.Pay-as-you-go Entity Resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5): 1111-1124.
[61] Hall R, Fienberg S E.Privacy-Preserving Record Linkage[C]// Proceedings of the 2013 International Conference on Privacy in Statistical Databases. 2010: 269-283.
[62] 韩姝敏, 申德荣, 聂铁铮, 等. 一种基于隐私保护下的多方记录链接方法[J]. 软件学报, 2017, 28(9): 2281-2292.
[62] (Han Shumin, Shen Derong, Nie Tiezheng, et al.Multi-Party Privacy-Preserving Record Linkage Apprpoach[J]. Journal of Software, 2017, 28(9): 2281-2292.)
[63] Kuzu M, Kantarcioglu M, Inan A, et al.Efficient Privacy-Aware Record Integration[C]// Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013: 167-178.
[64] Sweeney L.K-Anonymity: A Model for Protecting Privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 557-570.
[65] Sweeney L.Achieving K-Anonymity Privacy Protection Using Generalization and Suppression[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 571-588.
[66] Kantarcioglu M, Jiang W, Malin B.A Privacy-Preserving Framework for Integrating Person-Specific Databases[C]// Proceedings of the 2008 International Conference on Privacy in Statistical Databases. Springer, 2008: 298-314.
[67] Dwork C.Differential Privacy: A Survey of Results[C]// Proceedings of the 5th International Conference on Theory and Applications of Models of Computation. 2008: 1-19.
[68] Inan A, Kantarcioglu M, Ghinita G, et al.Private Record Matching Using Differential Privacy[C]// Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010: 123-134.
[69] 张啸剑, 孟小峰. 面向数据发布和分析的差分隐私保护[J]. 计算机学报, 2014, 37(4): 927-949.
[69] (Zhang Xiaojian, Meng Xiaofeng.Differential Privacy in Data Publication and Analysis[J]. Chinese Journal of Computers, 2014, 37(4): 927-949.)
[70] 孟小峰, 张啸剑. 大数据隐私管理[J]. 计算机研究与发展, 2015, 52(2): 265-281.
[70] (Meng Xiaofeng, Zhang Xiaojian.Big Data Privacy Management[J]. Journal of Computer Research and Development, 2015, 52(2): 265-281.)
[71] Schnell R, Bachteler T, Reiher J.Privacy-Preserving Record Linkage Using Bloom Filters[J]. BMC Medical Informatics and Decision Making, 2009, 9(1): 41.
[72] Bloom B H.Space/Time Trade-offs in Hash Coding with Allowable Errors[J]. Communications of the ACM, 1970, 13(7): 422-426.
[73] Jain N, Dahlin M, Tewari R.Using Bloom Filters to Refine Web Search Results[C]// Proceedings of the 8th International Workshop on the Web and Databases. 2005: 25-30.
[74] Durham E A, Kantarcioglu M, Xue Y, et al.Composite Bloom Filters for Secure Record Linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2956-2968.
[75] Vatsalan D, Christen P.Scalable Privacy-Preserving Record Linkage for Multiple Databases[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. ACM, 2014: 1795-1798.
[76] Randall S M, Ferrante A M, Boyd J H, et al.Privacy- Preserving Record Linkage on Large Real World Datasets[J]. Journal of Biomedical Informatics, 2014, 50: 205-212.
[77] Jurczyk P, Xiong L.Towards Privacy-Preserving Integration of Distributed Heterogeneous Data[C]// Proceedings of the 2nd PhD Workshop on Information and Knowledge Management. ACM, 2008: 65-72.
[78] Sheikh R, Mishra D K, Kumar B.Secure Multiparty Computation: From Millionaires Problem to Anonymizer[J]. Information Security Journal: A Global Perspective, 2011, 20(1): 25-33.
[79] Goldwasser S, Micali S, Rackoff C.The Knowledge Complexity of Interactive Proof Systems[J]. SIAM Journal on Computing, 1989, 18(1): 186-208.
[80] Quisquater J, Quisquater M, Quisquater M, et al.How to Explain Zero-Knowledge Protocols to Your Children[C]// Proceedings of the 1989 Conference on the Theory and Application of Cryptology. 1989: 628-631.
[81] Lindell Y, Pinkas B.Secure Multiparty Computation for Privacy-Preserving Data Mining[J]. Journal of Privacy & Confidentiality, 2009, 25(2): 761-766.
[82] 维克托·迈尔-舍恩伯格, 肯尼思·库克耶. 大数据时代:生活、工作与思维的大变革[M]. 杭州: 浙江人民出版社, 2013.
[82] (Viktor Mayer-Schonberger, Kenneth Cukier.Big Data: A Revolution That will Transform How We Live, Work, and Think[M]. Hangzhou: Zhejiang People’s Publishing House, 2013.)
[83] Kooli N, Allesiardo R, Pigneul E.Deep Learning Based Approach for Entity Resolution in Databases[C]// Proceedings of the 2018 Asian Conference on Intelligent Information and Database Systems. 2018: 3-12.
[84] Schmidhuber J.Deep Learning in Neural Networks: An Overview[J]. Neural Networks, 2015, 61: 85-117.
[85] Vatsalan D, Sehili Z, Christen P, et al.Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges[A]// Zomaya A, Sakr S. Handbook of Big Data Technologies[M]. 2017: 851-895.
[86] Altowim Y, Mehrotra S.Parallel Progressive Approach to Entity Resolution Using MapReduce[C]// Proceedings of the 33rd International Conference on Data Engineering. 2017: 909-920.
[1] Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[2] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[3] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[4] He Huixin,Liu Lijuan. A Scientific Research Object Labeling System Based on Active earning[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[5] Gao Guangshang, Zhang Zhixiong. Survey on Entity Resolution over Relational Databases[J]. 现代图书情报技术, 2015, 31(7-8): 37-47.
[6] Bi Qiumin, Li Ming, Zeng Zhiyong. Semi-supervised Micro-blog Sentiment Classification Method Combining Active Learning and Co-training[J]. 现代图书情报技术, 2015, 31(1): 38-44.
[7] Wang Pingshui. Research on Anonymous Privacy-Preserving Techniques Based on Clustering[J]. 现代图书情报技术, 2010, 26(11): 53-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn