Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (7-8): 37-47    DOI: 10.11925/infotech.1003-3513.2015.07.06
Current Issue | Archive | Adv Search |
Survey on Entity Resolution over Relational Databases
Gao Guangshang1,2, Zhang Zhixiong1
1 National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2 University of Chinese Academy of Sciences, Beijing 100049, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To analyze the research status and future research direction of Entity Resolution (ER) over relational databases. [Methods] Systematical researches are made on the accuracy and efficiency aspects of ER. The accuracy of ER is based on incremental methods, statistical methods and related information. The efficiency of ER is based on blocking, string similarity and other ideas. [Results] Maximizing precision and efficiency are the main goals of ER, but the research on dynamic evolution, heterogeneity of data sources and inexact string matching still faces significant challenges. [Limitations] Only precision and efficiency in the process of ER are discussed, but the characteristics and limitations of ER model don't get the same level of attentions. [Conclusions] This paper gives a comprehensive overview of the process of ER over relational databases, research status and future research direction.

Received: 09 December 2014      Published: 25 August 2015
:  TP393  

Cite this article:

Gao Guangshang, Zhang Zhixiong. Survey on Entity Resolution over Relational Databases. New Technology of Library and Information Service, 2015, 31(7-8): 37-47.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.07.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I7-8/37

[1] Newcombe H B, Kennedy J M, Axford S J, et al.Automatic Linkage of Vital Records [J]. Science, 1959, 130(3381): 954-959.
[2] Fellegi I P, Sunter A B.A Theory for Record Linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[3] Newcombe H B, Kennedy J M.Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information [J]. Communications of the ACM, 1962, 5(11): 563-566.
[4] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases[C]. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, California, USA. New York: ACM, 1995: 127-138.
[5] Sarawagi S, Bhamidipaty A. Interactive Deduplication Using Active Learning [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), Edmonton, Alberta, Canada. New York: ACM, 2002: 269-278.
[6] Dong X, Halevy A, Madhavan J. Reference Reconciliation in Complex Information Spaces [C].In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA. New York: ACM, 2005: 85-96.
[7] Tejada S, Knoblock C A, Minton S.Learning Object Identification Rules for Information Integration [J]. Information Systems, 2001, 26(8): 607-633.
[8] Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection [M]. Springer Berlin Heidelberg, 2012.
[9] Elmagarmid A K, Ipeirotis P G, Verykios V S.Duplicate Record Detection: A Survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[10] Winkler W E. Overview of Record Linkage and Current Research Directions [R]. Washington, D C: U.S. Census Brueau, 2006.
[11] Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh: A Generic Approach to Entity Resolution[C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 255-276.
[12] Bhattacharya I, Getoor L.Collective Entity Resolution in Relational Data [J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No. 5.
[13] Manning C D, Raghavan P, Schütze H, et al. Introduction to Information Retrieval [M]. Cambridge University Press, 2008: 496.
[14] Arasu A, Gotz M, Kaushik R. On Active Learning of Record Matching Packages [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 783-794.
[15] 刘骏豪, 孙晶莹.2011 年德国人口普查中的新技术——记录连接[J]. 中国统计, 2011(11): 38-39. (Liu Junhao, Sun Jingying. The New Technology in 2011 German Population Census——Record Connection [J]. China Statistics, 2011(11): 38-39.)
[16] 谭明超, 刁兴春, 曹建军.实体分辨研究综述[J]. 计算机科学, 2014, 41(4): 9-12, 20. (Tan Mingchao, Diao Xingchun, Cao Jianjun. Survey on Entity Resolution [J]. Computer Science, 2014, 41(4): 9-12, 20.)
[17] Müller H, Freytag J-C. Problems, Methods, and Challenges in Comprehensive Data Cleansing [M]. Humboldt University Berlin, 2003.
[18] Record Linkage in Large Data Sets [EB/OL]. [2014-12-02]. http://www.dani-sola.com/record-linkage-in-large-data-sets/.
[19] Herzog T N, Scheuren F J, Winkler W E. Data Quality and Record Linkage Techniques [M]. Springer-Verlag, 2007.
[20] Winkler W E. Methods for Record Linkage and Bayesian Networks [R]. Statistical Research Division, US Census Bureau, Washington, DC, 2002.
[21] Whang S E, Garcia-Molina H. Entity Resolution with Evolving Rules [C]. In: Proceedings of the 36th International Conference on Very Large Data Bases, Singapore. 2010: 1326-1337.
[22] Whang S E, Garcia-Molina H.Incremental Entity Resolution on Rules and Data [J]. The VLDB Journal, 2014, 23(1): 77-102.
[23] Whang S E, Garcia-Molina H.Developments in Generic Entity Resolution [J]. IEEE Data Engineering Bulletin, 2011, 13(11): 24-30.
[24] Whang S E, Menestrina D, Koutrika G, et al. Entity Resolution with Iterative Blocking [C]. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, Rhode Island, USA. New York: ACM, 2009: 219-232.
[25] Gruenheid A, Dong X L, Srivastava D. Incremental Record Linkage [C]. In: Proceedings of the 40th International Conference on Very Large Data Bases, Hangzhou, China, 2014: 697-708.
[26] Sarawagi S, Deshpande V, Kasliwal S. Efficient Top-k Count Queries over Imprecise Duplicates [C]. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, Saint Petersburg, Russia. New York: ACM, 2009: 450-461.
[27] Hernández M A, Stolfo S J.Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem [J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.
[28] Mathieu C, Sankur O, Schudy W.Online Correlation Clustering [OL]. ArXiv Preprint arXiv: 10010920.
[29] Charikar M, Chekuri C, Feder T, et al. Incremental Clustering and Dynamic Information Retrieval [C]. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC'97). New York: ACM, 1997: 626-635.
[30] Aggarwal C C, Han J, Wang J, et al. A Framework for Clustering Evolving Data Streams [C].In: Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany.2003: 81-92.
[31] Singla P, Domingos P. Collective Object Identification [C]. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, Scotland. San Francisco: Morgan Kaufmann Publishers Inc., 2005: 1636-1637.
[32] Christen P. Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA. New York: ACM, 2008: 151-159.
[33] 楼俊杰, 徐从富, 郝春亮.基于马尔科夫逻辑网络的实体解析改进算法[J]. 计算机科学, 2010, 37(8): 243-247. (Lou Junjie, Xu Congfu, Hao Chunliang. Improvement of Entity Resolution Based on Markov Logic Networks [J]. Computer Science, 2010, 37(8): 243-247.)
[34] Chaudhuri S, Ganti V, Xin D.Mining Document Collections to Facilitate Accurate Approximate Entity Matching [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 395-406.
[35] Shu L, Bo L, Meng W. A Latent Topic Model for Complete Entity Resolution [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09). IEEE, 2009: 880-891.
[36] Rastogi V, Dalvi N, Garofalakis M. Large-scale Collective Entity Matching [C]. In: Proceedings of the 37th International Conference on Very Large Data Bases, Seattle, Washington, USA.2011: 208-218.
[37] Getoor L, Machanavajjhala A.Entity Resolution: Theory, Practice & Open Challenges [C]. In: Proceedings the 38th International Conference on Very Large Data Bases, Istanbul, Turkey. 2012: 2018-2019.
[38] McCallum A, Nigam K, Ungar L H. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching [C]. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. New York: ACM, 2000: 169-178.
[39] 甄灵敏, 杨晓春, 王斌, 等.基于属性权重的实体解析技术 [J]. 计算机研究与发展, 2013, 50(S1): 281-289. (Zhen Lingmin, Yang Xiaochun, Wang Bin, et al. An Entity Resolution Approach Based on Attributes Weights [J]. Journal of Computer Research and Development, 2013, 50(S1): 281-289.)
[40] Kim H S, Lee D. HARRA: Fast Iterative Hashed Record Linkage for Large-scale Data Collections [C]. In: Proceedings of the 13th International Conference on Extending Database Technology, Lausanne, Switzerland. New York: ACM, 2010: 525-536.
[41] Vernica R, Carey M J, Li C. Efficient Parallel Set-similarity Joins Using MapReduce [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 495-506.
[42] Bilenko M, Kamath B, Mooney R J. Adaptive Blocking: Learning to Scale up Record Linkage [C]. In: Proceedings of the 6th International Conference on Data Mining (ICDM'06), Hong Kong, China.IEEE, 2006: 87-96.
[43] Baxter R, Christen P, Churches T. A Comparison of Fast Blocking Methods for Record Linkage [C]. In: Proceedings of the 1st Workshop on Data Cleaning, Record Linkage and Object Consolidation (KDD'03), Washington, DC, USA. 2003: 25-27.
[44] Kirsten T, Kolb L, Hartung M, et al.Data Partitioning for Parallel Entity Matching [OL]. arXiv Preprint arXiv: 10065309.
[45] Koudas N, Marathe A, Srivastava D. Flexible String Matching Against Large Databases in Practice [C]. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04), Toronto, Canada. 2004: 1078-1086.
[46] Chaudhuri S, Ganti V, Kaushik R. A Primitive Operator for Similarity Joins in Data Cleaning [C]. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE' 06). Washington DC: IEEE Computer Society, 2006: 5.
[47] Xiao C, Wang W, Lin X, et al. Efficient Similarity Joins for Near Duplicate Detection [C]. In: Proceedings of the 17th International Conference on World Wide Web, Beijing, China.New York: ACM, 2008: 131-140.
[48] Papapetrou P, Athitsos V, Kollios G, et al.Reference-based Alignment in Large Sequence Databases [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 205-216.
[49] Li C, Lu J, Lu Y. Efficient Merging and Filtering Algorithms for Approximate String Searches [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering, Cancun, Mexico.IEEE Computer Society, 2008: 257-266.
[50] Li C, Wang B, Yang X. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-length Grams [C]. In: Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria.2007: 303-314.
[51] Yang X, Wang B, Li C. Cost-based Variable-length-gram Selection for String Collections to Support Approximate Queries Efficiently [C]. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada.New York: ACM, 2008: 353-364.
[52] Behm A, Shengyue J, Chen L, et al. Space-Constrained Gram-Based Indexing for Efficient Approximate String Search [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09), Shanghai, China.IEEE, 2009: 604-615.
[53] 邱越峰, 田增平, 季文赟, 等.一种高效的检测相似重复记录的方法 [J]. 计算机学报, 2001, 24(1): 69-77. (Qiu Yuefeng, Tian Zengping, Ji Wenyun, et al. An Efficient Approach for Detecting Approximately Duplicate Database
Records [J]. Chinese Journal of Computers, 2001, 24(1): 69-77.)
[54] Lieberman M D, Sankaranarayanan J, Samet H. A Fast Similarity Join Algorithm Using Graphics Processing Units [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering.Washington DC: IEEE Computer Society, 2008: 1111-1120.
[55] 燕彩蓉, 万永权.并行实体解析与记录聚合模型 [J]. 小型微型计算机系统, 2013, 34(8): 1843-1847. (Yan Cairong, Wan Yongquan. Parallel Entity Resolution and Record Aggregation Model [J]. Journal of Chinese Computer Systems, 2013, 34(8): 1843-1847.)
[56] 燕彩蓉, 张洋舜, 徐光伟.支持隐私保护的众包实体解析 [J]. 计算机科学与探索, 2014, 8(7): 802-811. (Yan Cairong, Zhang Yangshun, Xu Guangwei. Crowdsourcing Entity Resolution with Privacy Protection [J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(7): 802-811.)
[57] 王宁, 李杰.大数据环境下用于实体解析的两层相关性聚类方法 [J]. 计算机研究与发展, 2014, 51(9): 2108-2116. (Wang Ning, Li Jie. Two-Tiered Correlation Clustering Method for Entity Resolution in Big Data [J]. Journal of Computer Research and Development, 2014, 51(9): 2108-2116.)
[58] 杨丹, 申德荣, 于戈, 等.数据空间中时间为中心的集合实体识别策略[J]. 计算机科学与探索, 2012, 6(11): 974-984. (Yang Dan, Shen Derong, Yu Ge, et al. Time-centered Collective Entity Resolution Strategy in Dataspace [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(11): 974-984.)

[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Li Wenna,Zhang Zhixiong. Research on Knowledge Base Error Detection Method Based on Confidence Learning[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[3] Sun Yu, Qiu Jiangnan. Research on Influence of Opinion Leaders Based on Network Analysis and Text Mining [J]. 数据分析与知识发现, 0, (): 1-.
[4] Wang Qinjie, Qin Chunxiu, Ma Xubu, Liu Huailiang, Xu Cunzhen. Recommending Scientific Literature Based on Author Preference and Heterogeneous Information Network[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[5] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[6] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[7] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[8] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[10] Wang Xiwei,Jia Ruonan,Wei Yanan,Zhang Liu. Clustering User Groups of Public Opinion Events from Multi-dimensional Social Network[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[11] Ruan Xiaoyun,Liao Jianbin,Li Xiang,Yang Yang,Li Daifeng. Interpretable Recommendation of Reinforcement Learning Based on Talent Knowledge Graph Reasoning[J]. 数据分析与知识发现, 2021, 5(6): 36-50.
[12] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] Chen Wenjie,Wen Yi,Yang Ning. Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[14] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn