Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (7-8): 37-47    DOI: 10.11925/infotech.1003-3513.2015.07.06
  综述评介 本期目录 | 过刊浏览 | 高级检索 |
关系数据库中实体解析研究综述
高广尚1,2, 张智雄1
1 中国科学院文献情报中心 北京 100190;
2 中国科学院大学 北京 100049
Survey on Entity Resolution over Relational Databases
Gao Guangshang1,2, Zhang Zhixiong1
1 National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2 University of Chinese Academy of Sciences, Beijing 100049, China
全文: PDF(515 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

目的】分析关系数据库中实体解析技术的研究现状和未来研究方向。【方法】从实体解析的精度和效率两方面展开系统研究。精度方面基于增量式、统计方法和相关信息; 效率方面基于分块、字符串相似和其他方法。【结果】最大化实体解析精度和解析效率是实体解析技术研究的主要目标, 但在数据源的动态演化、异构性和非精确字符串匹配等方面的研究仍面临重大挑战。【局限】仅从实体解析过程中的精度和效率方面进行探讨, 对解析模型本身的特点和局限性关注不足。【结论】本研究有助于更全面了解关系数据库中实体解析的过程、研究现状和未来研究方向。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
Abstract

[Objective] To analyze the research status and future research direction of Entity Resolution (ER) over relational databases. [Methods] Systematical researches are made on the accuracy and efficiency aspects of ER. The accuracy of ER is based on incremental methods, statistical methods and related information. The efficiency of ER is based on blocking, string similarity and other ideas. [Results] Maximizing precision and efficiency are the main goals of ER, but the research on dynamic evolution, heterogeneity of data sources and inexact string matching still faces significant challenges. [Limitations] Only precision and efficiency in the process of ER are discussed, but the characteristics and limitations of ER model don't get the same level of attentions. [Conclusions] This paper gives a comprehensive overview of the process of ER over relational databases, research status and future research direction.

收稿日期: 2014-12-09     
:  TP393  
基金资助:

本文系国家"十二五"科技支撑计划课题"科技知识组织体系共享平台建设"(项目编号: 2011BAH10B03)的研究成果之一。

通讯作者: 高广尚, ORCID: 0000-0003-4140-1735, E-mail: gaoguangshang@mail.las.ac.cn。     E-mail: gaoguangshang@mail.las.ac.cn
作者简介: 作者贡献声明: 高广尚: 研究过程实施, 文献调研、分析, 论文撰写和最终版本修订; 张智雄: 提出研究思路。
引用本文:   
高广尚, 张智雄. 关系数据库中实体解析研究综述[J]. 现代图书情报技术, 2015, 31(7-8): 37-47.
Gao Guangshang, Zhang Zhixiong. Survey on Entity Resolution over Relational Databases. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2015.07.06.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.07.06

[1] Newcombe H B, Kennedy J M, Axford S J, et al.Automatic Linkage of Vital Records [J]. Science, 1959, 130(3381): 954-959.
[2] Fellegi I P, Sunter A B.A Theory for Record Linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[3] Newcombe H B, Kennedy J M.Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information [J]. Communications of the ACM, 1962, 5(11): 563-566.
[4] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases[C]. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, California, USA. New York: ACM, 1995: 127-138.
[5] Sarawagi S, Bhamidipaty A. Interactive Deduplication Using Active Learning [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), Edmonton, Alberta, Canada. New York: ACM, 2002: 269-278.
[6] Dong X, Halevy A, Madhavan J. Reference Reconciliation in Complex Information Spaces [C].In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA. New York: ACM, 2005: 85-96.
[7] Tejada S, Knoblock C A, Minton S.Learning Object Identification Rules for Information Integration [J]. Information Systems, 2001, 26(8): 607-633.
[8] Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection [M]. Springer Berlin Heidelberg, 2012.
[9] Elmagarmid A K, Ipeirotis P G, Verykios V S.Duplicate Record Detection: A Survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[10] Winkler W E. Overview of Record Linkage and Current Research Directions [R]. Washington, D C: U.S. Census Brueau, 2006.
[11] Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh: A Generic Approach to Entity Resolution[C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 255-276.
[12] Bhattacharya I, Getoor L.Collective Entity Resolution in Relational Data [J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No. 5.
[13] Manning C D, Raghavan P, Schütze H, et al. Introduction to Information Retrieval [M]. Cambridge University Press, 2008: 496.
[14] Arasu A, Gotz M, Kaushik R. On Active Learning of Record Matching Packages [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 783-794.
[15] 刘骏豪, 孙晶莹.2011 年德国人口普查中的新技术——记录连接[J]. 中国统计, 2011(11): 38-39. (Liu Junhao, Sun Jingying. The New Technology in 2011 German Population Census——Record Connection [J]. China Statistics, 2011(11): 38-39.)
[16] 谭明超, 刁兴春, 曹建军.实体分辨研究综述[J]. 计算机科学, 2014, 41(4): 9-12, 20. (Tan Mingchao, Diao Xingchun, Cao Jianjun. Survey on Entity Resolution [J]. Computer Science, 2014, 41(4): 9-12, 20.)
[17] Müller H, Freytag J-C. Problems, Methods, and Challenges in Comprehensive Data Cleansing [M]. Humboldt University Berlin, 2003.
[18] Record Linkage in Large Data Sets [EB/OL]. [2014-12-02]. http://www.dani-sola.com/record-linkage-in-large-data-sets/.
[19] Herzog T N, Scheuren F J, Winkler W E. Data Quality and Record Linkage Techniques [M]. Springer-Verlag, 2007.
[20] Winkler W E. Methods for Record Linkage and Bayesian Networks [R]. Statistical Research Division, US Census Bureau, Washington, DC, 2002.
[21] Whang S E, Garcia-Molina H. Entity Resolution with Evolving Rules [C]. In: Proceedings of the 36th International Conference on Very Large Data Bases, Singapore. 2010: 1326-1337.
[22] Whang S E, Garcia-Molina H.Incremental Entity Resolution on Rules and Data [J]. The VLDB Journal, 2014, 23(1): 77-102.
[23] Whang S E, Garcia-Molina H.Developments in Generic Entity Resolution [J]. IEEE Data Engineering Bulletin, 2011, 13(11): 24-30.
[24] Whang S E, Menestrina D, Koutrika G, et al. Entity Resolution with Iterative Blocking [C]. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, Rhode Island, USA. New York: ACM, 2009: 219-232.
[25] Gruenheid A, Dong X L, Srivastava D. Incremental Record Linkage [C]. In: Proceedings of the 40th International Conference on Very Large Data Bases, Hangzhou, China, 2014: 697-708.
[26] Sarawagi S, Deshpande V, Kasliwal S. Efficient Top-k Count Queries over Imprecise Duplicates [C]. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, Saint Petersburg, Russia. New York: ACM, 2009: 450-461.
[27] Hernández M A, Stolfo S J.Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem [J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.
[28] Mathieu C, Sankur O, Schudy W.Online Correlation Clustering [OL]. ArXiv Preprint arXiv: 10010920.
[29] Charikar M, Chekuri C, Feder T, et al. Incremental Clustering and Dynamic Information Retrieval [C]. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC'97). New York: ACM, 1997: 626-635.
[30] Aggarwal C C, Han J, Wang J, et al. A Framework for Clustering Evolving Data Streams [C].In: Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany.2003: 81-92.
[31] Singla P, Domingos P. Collective Object Identification [C]. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, Scotland. San Francisco: Morgan Kaufmann Publishers Inc., 2005: 1636-1637.
[32] Christen P. Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA. New York: ACM, 2008: 151-159.
[33] 楼俊杰, 徐从富, 郝春亮.基于马尔科夫逻辑网络的实体解析改进算法[J]. 计算机科学, 2010, 37(8): 243-247. (Lou Junjie, Xu Congfu, Hao Chunliang. Improvement of Entity Resolution Based on Markov Logic Networks [J]. Computer Science, 2010, 37(8): 243-247.)
[34] Chaudhuri S, Ganti V, Xin D.Mining Document Collections to Facilitate Accurate Approximate Entity Matching [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 395-406.
[35] Shu L, Bo L, Meng W. A Latent Topic Model for Complete Entity Resolution [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09). IEEE, 2009: 880-891.
[36] Rastogi V, Dalvi N, Garofalakis M. Large-scale Collective Entity Matching [C]. In: Proceedings of the 37th International Conference on Very Large Data Bases, Seattle, Washington, USA.2011: 208-218.
[37] Getoor L, Machanavajjhala A.Entity Resolution: Theory, Practice & Open Challenges [C]. In: Proceedings the 38th International Conference on Very Large Data Bases, Istanbul, Turkey. 2012: 2018-2019.
[38] McCallum A, Nigam K, Ungar L H. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching [C]. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. New York: ACM, 2000: 169-178.
[39] 甄灵敏, 杨晓春, 王斌, 等.基于属性权重的实体解析技术 [J]. 计算机研究与发展, 2013, 50(S1): 281-289. (Zhen Lingmin, Yang Xiaochun, Wang Bin, et al. An Entity Resolution Approach Based on Attributes Weights [J]. Journal of Computer Research and Development, 2013, 50(S1): 281-289.)
[40] Kim H S, Lee D. HARRA: Fast Iterative Hashed Record Linkage for Large-scale Data Collections [C]. In: Proceedings of the 13th International Conference on Extending Database Technology, Lausanne, Switzerland. New York: ACM, 2010: 525-536.
[41] Vernica R, Carey M J, Li C. Efficient Parallel Set-similarity Joins Using MapReduce [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 495-506.
[42] Bilenko M, Kamath B, Mooney R J. Adaptive Blocking: Learning to Scale up Record Linkage [C]. In: Proceedings of the 6th International Conference on Data Mining (ICDM'06), Hong Kong, China.IEEE, 2006: 87-96.
[43] Baxter R, Christen P, Churches T. A Comparison of Fast Blocking Methods for Record Linkage [C]. In: Proceedings of the 1st Workshop on Data Cleaning, Record Linkage and Object Consolidation (KDD'03), Washington, DC, USA. 2003: 25-27.
[44] Kirsten T, Kolb L, Hartung M, et al.Data Partitioning for Parallel Entity Matching [OL]. arXiv Preprint arXiv: 10065309.
[45] Koudas N, Marathe A, Srivastava D. Flexible String Matching Against Large Databases in Practice [C]. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04), Toronto, Canada. 2004: 1078-1086.
[46] Chaudhuri S, Ganti V, Kaushik R. A Primitive Operator for Similarity Joins in Data Cleaning [C]. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE' 06). Washington DC: IEEE Computer Society, 2006: 5.
[47] Xiao C, Wang W, Lin X, et al. Efficient Similarity Joins for Near Duplicate Detection [C]. In: Proceedings of the 17th International Conference on World Wide Web, Beijing, China.New York: ACM, 2008: 131-140.
[48] Papapetrou P, Athitsos V, Kollios G, et al.Reference-based Alignment in Large Sequence Databases [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 205-216.
[49] Li C, Lu J, Lu Y. Efficient Merging and Filtering Algorithms for Approximate String Searches [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering, Cancun, Mexico.IEEE Computer Society, 2008: 257-266.
[50] Li C, Wang B, Yang X. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-length Grams [C]. In: Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria.2007: 303-314.
[51] Yang X, Wang B, Li C. Cost-based Variable-length-gram Selection for String Collections to Support Approximate Queries Efficiently [C]. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada.New York: ACM, 2008: 353-364.
[52] Behm A, Shengyue J, Chen L, et al. Space-Constrained Gram-Based Indexing for Efficient Approximate String Search [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09), Shanghai, China.IEEE, 2009: 604-615.
[53] 邱越峰, 田增平, 季文赟, 等.一种高效的检测相似重复记录的方法 [J]. 计算机学报, 2001, 24(1): 69-77. (Qiu Yuefeng, Tian Zengping, Ji Wenyun, et al. An Efficient Approach for Detecting Approximately Duplicate Database
Records [J]. Chinese Journal of Computers, 2001, 24(1): 69-77.)
[54] Lieberman M D, Sankaranarayanan J, Samet H. A Fast Similarity Join Algorithm Using Graphics Processing Units [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering.Washington DC: IEEE Computer Society, 2008: 1111-1120.
[55] 燕彩蓉, 万永权.并行实体解析与记录聚合模型 [J]. 小型微型计算机系统, 2013, 34(8): 1843-1847. (Yan Cairong, Wan Yongquan. Parallel Entity Resolution and Record Aggregation Model [J]. Journal of Chinese Computer Systems, 2013, 34(8): 1843-1847.)
[56] 燕彩蓉, 张洋舜, 徐光伟.支持隐私保护的众包实体解析 [J]. 计算机科学与探索, 2014, 8(7): 802-811. (Yan Cairong, Zhang Yangshun, Xu Guangwei. Crowdsourcing Entity Resolution with Privacy Protection [J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(7): 802-811.)
[57] 王宁, 李杰.大数据环境下用于实体解析的两层相关性聚类方法 [J]. 计算机研究与发展, 2014, 51(9): 2108-2116. (Wang Ning, Li Jie. Two-Tiered Correlation Clustering Method for Entity Resolution in Big Data [J]. Journal of Computer Research and Development, 2014, 51(9): 2108-2116.)
[58] 杨丹, 申德荣, 于戈, 等.数据空间中时间为中心的集合实体识别策略[J]. 计算机科学与探索, 2012, 6(11): 974-984. (Yang Dan, Shen Derong, Yu Ge, et al. Time-centered Collective Entity Resolution Strategy in Dataspace [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(11): 974-984.)

[1] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[2] 夏立新,曾杰妍,毕崇武,叶光辉. 基于LDA主题模型的用户兴趣层级演化研究 *[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[3] 杨宁, 黄飞虎, 文奕, 陈云伟. 基于微博用户行为的观点传播模型[J]. 现代图书情报技术, 2015, 31(12): 34-41.
[4] 余昕聪, 李红莲, 吕学强. 本体上下位关系在招生问答机器人中的应用研究[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[5] 王政军, 俞小怡, 金玉玲. 利用旁路监听技术约束数字资源过量下载[J]. 现代图书情报技术, 2015, 31(12): 95-100.
[6] 刘占兵, 肖诗斌. 基于用户兴趣模糊聚类的协同过滤算法[J]. 现代图书情报技术, 2015, 31(11): 12-17.
[7] 伍万坤, 吴清烈, 顾锦江. 基于EM-LDA综合模型的电商微博热点话题发现[J]. 现代图书情报技术, 2015, 31(11): 33-40.
[8] 强韶华, 吴鹏. 地域性差异视角下的网站分类用户心智模型空间性研究[J]. 现代图书情报技术, 2015, 31(11): 68-74.
[9] 秦学东. 基于Drupal的KVM私有云管理系统解决方案[J]. 现代图书情报技术, 2015, 31(11): 91-95.
[10] 吴江, 张劲帆. 社会网络三元结构中关注影响力研究——以学生关系网络为例[J]. 现代图书情报技术, 2015, 31(10): 72-80.
[11] 姜春涛. 自动标注中文专利的引文信息[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[12] 王颖, 张智雄, 李传席, 刘毅, 汤怡洁, 周子健, 钱力, 付鸿鹄. 科技知识组织体系开放引擎系统的设计与实现[J]. 现代图书情报技术, 2015, 31(10): 95-101.
[13] 桂思思, 陆伟, 黄诗豪, 周鹏程. 融合主题模型及多时间节点函数的用户兴趣预测研究[J]. 现代图书情报技术, 2015, 31(9): 9-16.
[14] 秦晓慧, 乐小虬. 面向单篇文献引文网络的主题来源与走向追踪[J]. 现代图书情报技术, 2015, 31(9): 52-59.
[15] 邓启平, 王小梅. 利用LeaderRank识别有影响力的作者[J]. 现代图书情报技术, 2015, 31(9): 60-67.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn