Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (7-8): 37-47    DOI: 10.11925/infotech.1003-3513.2015.07.06
Current Issue | Archive | Adv Search |
Survey on Entity Resolution over Relational Databases
Gao Guangshang1,2, Zhang Zhixiong1
1 National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2 University of Chinese Academy of Sciences, Beijing 100049, China
Download: PDF(515 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To analyze the research status and future research direction of Entity Resolution (ER) over relational databases. [Methods] Systematical researches are made on the accuracy and efficiency aspects of ER. The accuracy of ER is based on incremental methods, statistical methods and related information. The efficiency of ER is based on blocking, string similarity and other ideas. [Results] Maximizing precision and efficiency are the main goals of ER, but the research on dynamic evolution, heterogeneity of data sources and inexact string matching still faces significant challenges. [Limitations] Only precision and efficiency in the process of ER are discussed, but the characteristics and limitations of ER model don't get the same level of attentions. [Conclusions] This paper gives a comprehensive overview of the process of ER over relational databases, research status and future research direction.

Received: 09 December 2014      Published: 25 August 2015
:  TP393  

Cite this article:

Gao Guangshang, Zhang Zhixiong. Survey on Entity Resolution over Relational Databases. New Technology of Library and Information Service, 2015, 31(7-8): 37-47.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.07.06     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I7-8/37

[1] Newcombe H B, Kennedy J M, Axford S J, et al.Automatic Linkage of Vital Records [J]. Science, 1959, 130(3381): 954-959.
[2] Fellegi I P, Sunter A B.A Theory for Record Linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[3] Newcombe H B, Kennedy J M.Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information [J]. Communications of the ACM, 1962, 5(11): 563-566.
[4] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases[C]. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, California, USA. New York: ACM, 1995: 127-138.
[5] Sarawagi S, Bhamidipaty A. Interactive Deduplication Using Active Learning [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), Edmonton, Alberta, Canada. New York: ACM, 2002: 269-278.
[6] Dong X, Halevy A, Madhavan J. Reference Reconciliation in Complex Information Spaces [C].In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA. New York: ACM, 2005: 85-96.
[7] Tejada S, Knoblock C A, Minton S.Learning Object Identification Rules for Information Integration [J]. Information Systems, 2001, 26(8): 607-633.
[8] Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection [M]. Springer Berlin Heidelberg, 2012.
[9] Elmagarmid A K, Ipeirotis P G, Verykios V S.Duplicate Record Detection: A Survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[10] Winkler W E. Overview of Record Linkage and Current Research Directions [R]. Washington, D C: U.S. Census Brueau, 2006.
[11] Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh: A Generic Approach to Entity Resolution[C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 255-276.
[12] Bhattacharya I, Getoor L.Collective Entity Resolution in Relational Data [J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No. 5.
[13] Manning C D, Raghavan P, Schütze H, et al. Introduction to Information Retrieval [M]. Cambridge University Press, 2008: 496.
[14] Arasu A, Gotz M, Kaushik R. On Active Learning of Record Matching Packages [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 783-794.
[15] 刘骏豪, 孙晶莹.2011 年德国人口普查中的新技术——记录连接[J]. 中国统计, 2011(11): 38-39. (Liu Junhao, Sun Jingying. The New Technology in 2011 German Population Census——Record Connection [J]. China Statistics, 2011(11): 38-39.)
[16] 谭明超, 刁兴春, 曹建军.实体分辨研究综述[J]. 计算机科学, 2014, 41(4): 9-12, 20. (Tan Mingchao, Diao Xingchun, Cao Jianjun. Survey on Entity Resolution [J]. Computer Science, 2014, 41(4): 9-12, 20.)
[17] Müller H, Freytag J-C. Problems, Methods, and Challenges in Comprehensive Data Cleansing [M]. Humboldt University Berlin, 2003.
[18] Record Linkage in Large Data Sets [EB/OL]. [2014-12-02]. http://www.dani-sola.com/record-linkage-in-large-data-sets/.
[19] Herzog T N, Scheuren F J, Winkler W E. Data Quality and Record Linkage Techniques [M]. Springer-Verlag, 2007.
[20] Winkler W E. Methods for Record Linkage and Bayesian Networks [R]. Statistical Research Division, US Census Bureau, Washington, DC, 2002.
[21] Whang S E, Garcia-Molina H. Entity Resolution with Evolving Rules [C]. In: Proceedings of the 36th International Conference on Very Large Data Bases, Singapore. 2010: 1326-1337.
[22] Whang S E, Garcia-Molina H.Incremental Entity Resolution on Rules and Data [J]. The VLDB Journal, 2014, 23(1): 77-102.
[23] Whang S E, Garcia-Molina H.Developments in Generic Entity Resolution [J]. IEEE Data Engineering Bulletin, 2011, 13(11): 24-30.
[24] Whang S E, Menestrina D, Koutrika G, et al. Entity Resolution with Iterative Blocking [C]. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, Rhode Island, USA. New York: ACM, 2009: 219-232.
[25] Gruenheid A, Dong X L, Srivastava D. Incremental Record Linkage [C]. In: Proceedings of the 40th International Conference on Very Large Data Bases, Hangzhou, China, 2014: 697-708.
[26] Sarawagi S, Deshpande V, Kasliwal S. Efficient Top-k Count Queries over Imprecise Duplicates [C]. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, Saint Petersburg, Russia. New York: ACM, 2009: 450-461.
[27] Hernández M A, Stolfo S J.Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem [J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.
[28] Mathieu C, Sankur O, Schudy W.Online Correlation Clustering [OL]. ArXiv Preprint arXiv: 10010920.
[29] Charikar M, Chekuri C, Feder T, et al. Incremental Clustering and Dynamic Information Retrieval [C]. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC'97). New York: ACM, 1997: 626-635.
[30] Aggarwal C C, Han J, Wang J, et al. A Framework for Clustering Evolving Data Streams [C].In: Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany.2003: 81-92.
[31] Singla P, Domingos P. Collective Object Identification [C]. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, Scotland. San Francisco: Morgan Kaufmann Publishers Inc., 2005: 1636-1637.
[32] Christen P. Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA. New York: ACM, 2008: 151-159.
[33] 楼俊杰, 徐从富, 郝春亮.基于马尔科夫逻辑网络的实体解析改进算法[J]. 计算机科学, 2010, 37(8): 243-247. (Lou Junjie, Xu Congfu, Hao Chunliang. Improvement of Entity Resolution Based on Markov Logic Networks [J]. Computer Science, 2010, 37(8): 243-247.)
[34] Chaudhuri S, Ganti V, Xin D.Mining Document Collections to Facilitate Accurate Approximate Entity Matching [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 395-406.
[35] Shu L, Bo L, Meng W. A Latent Topic Model for Complete Entity Resolution [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09). IEEE, 2009: 880-891.
[36] Rastogi V, Dalvi N, Garofalakis M. Large-scale Collective Entity Matching [C]. In: Proceedings of the 37th International Conference on Very Large Data Bases, Seattle, Washington, USA.2011: 208-218.
[37] Getoor L, Machanavajjhala A.Entity Resolution: Theory, Practice & Open Challenges [C]. In: Proceedings the 38th International Conference on Very Large Data Bases, Istanbul, Turkey. 2012: 2018-2019.
[38] McCallum A, Nigam K, Ungar L H. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching [C]. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. New York: ACM, 2000: 169-178.
[39] 甄灵敏, 杨晓春, 王斌, 等.基于属性权重的实体解析技术 [J]. 计算机研究与发展, 2013, 50(S1): 281-289. (Zhen Lingmin, Yang Xiaochun, Wang Bin, et al. An Entity Resolution Approach Based on Attributes Weights [J]. Journal of Computer Research and Development, 2013, 50(S1): 281-289.)
[40] Kim H S, Lee D. HARRA: Fast Iterative Hashed Record Linkage for Large-scale Data Collections [C]. In: Proceedings of the 13th International Conference on Extending Database Technology, Lausanne, Switzerland. New York: ACM, 2010: 525-536.
[41] Vernica R, Carey M J, Li C. Efficient Parallel Set-similarity Joins Using MapReduce [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 495-506.
[42] Bilenko M, Kamath B, Mooney R J. Adaptive Blocking: Learning to Scale up Record Linkage [C]. In: Proceedings of the 6th International Conference on Data Mining (ICDM'06), Hong Kong, China.IEEE, 2006: 87-96.
[43] Baxter R, Christen P, Churches T. A Comparison of Fast Blocking Methods for Record Linkage [C]. In: Proceedings of the 1st Workshop on Data Cleaning, Record Linkage and Object Consolidation (KDD'03), Washington, DC, USA. 2003: 25-27.
[44] Kirsten T, Kolb L, Hartung M, et al.Data Partitioning for Parallel Entity Matching [OL]. arXiv Preprint arXiv: 10065309.
[45] Koudas N, Marathe A, Srivastava D. Flexible String Matching Against Large Databases in Practice [C]. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04), Toronto, Canada. 2004: 1078-1086.
[46] Chaudhuri S, Ganti V, Kaushik R. A Primitive Operator for Similarity Joins in Data Cleaning [C]. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE' 06). Washington DC: IEEE Computer Society, 2006: 5.
[47] Xiao C, Wang W, Lin X, et al. Efficient Similarity Joins for Near Duplicate Detection [C]. In: Proceedings of the 17th International Conference on World Wide Web, Beijing, China.New York: ACM, 2008: 131-140.
[48] Papapetrou P, Athitsos V, Kollios G, et al.Reference-based Alignment in Large Sequence Databases [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 205-216.
[49] Li C, Lu J, Lu Y. Efficient Merging and Filtering Algorithms for Approximate String Searches [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering, Cancun, Mexico.IEEE Computer Society, 2008: 257-266.
[50] Li C, Wang B, Yang X. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-length Grams [C]. In: Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria.2007: 303-314.
[51] Yang X, Wang B, Li C. Cost-based Variable-length-gram Selection for String Collections to Support Approximate Queries Efficiently [C]. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada.New York: ACM, 2008: 353-364.
[52] Behm A, Shengyue J, Chen L, et al. Space-Constrained Gram-Based Indexing for Efficient Approximate String Search [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09), Shanghai, China.IEEE, 2009: 604-615.
[53] 邱越峰, 田增平, 季文赟, 等.一种高效的检测相似重复记录的方法 [J]. 计算机学报, 2001, 24(1): 69-77. (Qiu Yuefeng, Tian Zengping, Ji Wenyun, et al. An Efficient Approach for Detecting Approximately Duplicate Database
Records [J]. Chinese Journal of Computers, 2001, 24(1): 69-77.)
[54] Lieberman M D, Sankaranarayanan J, Samet H. A Fast Similarity Join Algorithm Using Graphics Processing Units [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering.Washington DC: IEEE Computer Society, 2008: 1111-1120.
[55] 燕彩蓉, 万永权.并行实体解析与记录聚合模型 [J]. 小型微型计算机系统, 2013, 34(8): 1843-1847. (Yan Cairong, Wan Yongquan. Parallel Entity Resolution and Record Aggregation Model [J]. Journal of Chinese Computer Systems, 2013, 34(8): 1843-1847.)
[56] 燕彩蓉, 张洋舜, 徐光伟.支持隐私保护的众包实体解析 [J]. 计算机科学与探索, 2014, 8(7): 802-811. (Yan Cairong, Zhang Yangshun, Xu Guangwei. Crowdsourcing Entity Resolution with Privacy Protection [J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(7): 802-811.)
[57] 王宁, 李杰.大数据环境下用于实体解析的两层相关性聚类方法 [J]. 计算机研究与发展, 2014, 51(9): 2108-2116. (Wang Ning, Li Jie. Two-Tiered Correlation Clustering Method for Entity Resolution in Big Data [J]. Journal of Computer Research and Development, 2014, 51(9): 2108-2116.)
[58] 杨丹, 申德荣, 于戈, 等.数据空间中时间为中心的集合实体识别策略[J]. 计算机科学与探索, 2012, 6(11): 974-984. (Yang Dan, Shen Derong, Yu Ge, et al. Time-centered Collective Entity Resolution Strategy in Dataspace [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(11): 974-984.)

[1] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[2] Lixin Xia,Jieyan Zeng,Chongwu Bi,Guanghui Ye. Identifying Hierarchy Evolution of User Interests with LDA Topic Model[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[3] Yang Ning, Huang Feihu, Wen Yi, Chen Yunwei. An Opinion Evolution Model Based on the Behavior of Micro-blog Users[J]. 现代图书情报技术, 2015, 31(12): 34-41.
[4] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[5] Wang Zhengjun, Yu Xiaoyi, Jin Yuling. Using Sniffer Technology to Constraint Electronic Resource Excessive Downloading[J]. 现代图书情报技术, 2015, 31(12): 95-100.
[6] Liu Zhanbing, Xiao Shibin. Collaborative Filtering Recommended Algorithm Based on User's Interest Fuzzy Clustering[J]. 现代图书情报技术, 2015, 31(11): 12-17.
[7] Wu Wankun, Wu Qinglie, Gu Jinjiang. Hot Topic Extraction from E-commerce Microblog Based on EM-LDA Integrated Model[J]. 现代图书情报技术, 2015, 31(11): 33-40.
[8] Qiang Shaohua, Wu Peng. The Research of Spatial Measure of Users' Mental Model of Website Category from the View of Regional Differences[J]. 现代图书情报技术, 2015, 31(11): 68-74.
[9] Qin Xuedong. Solution for KVM Private Cloud Management System Based on Drupal[J]. 现代图书情报技术, 2015, 31(11): 91-95.
[10] Wu Jiang, Zhang Jinfan. Research on Follow Influence of Triadic Structure in Social Network——Take Student Relation Network as an Example[J]. 现代图书情报技术, 2015, 31(10): 72-80.
[11] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[12] Wang Ying, Zhang Zhixiong, Li Chuanxi, Liu Yi, Tang Yijie, Zhou Zijian, Qian Li, Fu Honghu. The Design and Implementation of Open Engine System for Scientific & Technological Knowledge Organization Systems[J]. 现代图书情报技术, 2015, 31(10): 95-101.
[13] Gui Sisi, Lu Wei, Huang Shihao, Zhou Pengcheng. User Interest Prediction Combing Topic Model and Multi-time Function[J]. 现代图书情报技术, 2015, 31(9): 9-16.
[14] Qin Xiaohui, Le Xiaoqiu. Topic Sources and Trends Tracking Towards Citation Network of Single Paper[J]. 现代图书情报技术, 2015, 31(9): 52-59.
[15] Deng Qiping, Wang Xiaomei. Identifying Influential Authors Based on LeaderRank[J]. 现代图书情报技术, 2015, 31(9): 60-67.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn