关系数据库中实体解析研究综述

doi:10.11925/infotech.1003-3513.2015.07.06

现代图书情报技术

2015, Vol. 31

Issue (7-8): 37-47 https://doi.org/10.11925/infotech.1003-3513.2015.07.06

综述评介

本期目录 | 过刊浏览 | 高级检索

关系数据库中实体解析研究综述

高广尚^1,2, 张智雄¹

1 中国科学院文献情报中心北京 100190;
2 中国科学院大学北京 100049

Survey on Entity Resolution over Relational Databases

Gao Guangshang^1,2, Zhang Zhixiong¹

1 National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2 University of Chinese Academy of Sciences, Beijing 100049, China

摘要
参考文献
相关文章
Metrics

全文: PDF (515 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

【目的】分析关系数据库中实体解析技术的研究现状和未来研究方向。【方法】从实体解析的精度和效率两方面展开系统研究。精度方面基于增量式、统计方法和相关信息; 效率方面基于分块、字符串相似和其他方法。【结果】最大化实体解析精度和解析效率是实体解析技术研究的主要目标, 但在数据源的动态演化、异构性和非精确字符串匹配等方面的研究仍面临重大挑战。【局限】仅从实体解析过程中的精度和效率方面进行探讨, 对解析模型本身的特点和局限性关注不足。【结论】本研究有助于更全面了解关系数据库中实体解析的过程、研究现状和未来研究方向。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

Abstract：

[Objective] To analyze the research status and future research direction of Entity Resolution (ER) over relational databases. [Methods] Systematical researches are made on the accuracy and efficiency aspects of ER. The accuracy of ER is based on incremental methods, statistical methods and related information. The efficiency of ER is based on blocking, string similarity and other ideas. [Results] Maximizing precision and efficiency are the main goals of ER, but the research on dynamic evolution, heterogeneity of data sources and inexact string matching still faces significant challenges. [Limitations] Only precision and efficiency in the process of ER are discussed, but the characteristics and limitations of ER model don't get the same level of attentions. [Conclusions] This paper gives a comprehensive overview of the process of ER over relational databases, research status and future research direction.

收稿日期: 2014-12-09 出版日期: 2015-08-25

TP393

基金资助:

本文系国家"十二五"科技支撑计划课题"科技知识组织体系共享平台建设"(项目编号: 2011BAH10B03)的研究成果之一。

通讯作者: 高广尚, ORCID: 0000-0003-4140-1735, E-mail: gaoguangshang@mail.las.ac.cn。 E-mail: gaoguangshang@mail.las.ac.cn

作者简介: 作者贡献声明: 高广尚: 研究过程实施, 文献调研、分析, 论文撰写和最终版本修订; 张智雄: 提出研究思路。

引用本文:

高广尚, 张智雄. 关系数据库中实体解析研究综述[J]. 现代图书情报技术, 2015, 31(7-8): 37-47.
Gao Guangshang, Zhang Zhixiong. Survey on Entity Resolution over Relational Databases. New Technology of Library and Information Service, 2015, 31(7-8): 37-47.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.07.06 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I7-8/37

[1] Newcombe H B, Kennedy J M, Axford S J, et al.Automatic Linkage of Vital Records [J]. Science, 1959, 130(3381): 954-959.
[2] Fellegi I P, Sunter A B.A Theory for Record Linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[3] Newcombe H B, Kennedy J M.Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information [J]. Communications of the ACM, 1962, 5(11): 563-566.
[4] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases[C]. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, California, USA. New York: ACM, 1995: 127-138.
[5] Sarawagi S, Bhamidipaty A. Interactive Deduplication Using Active Learning [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), Edmonton, Alberta, Canada. New York: ACM, 2002: 269-278.
[6] Dong X, Halevy A, Madhavan J. Reference Reconciliation in Complex Information Spaces [C].In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA. New York: ACM, 2005: 85-96.
[7] Tejada S, Knoblock C A, Minton S.Learning Object Identification Rules for Information Integration [J]. Information Systems, 2001, 26(8): 607-633.
[8] Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection [M]. Springer Berlin Heidelberg, 2012.
[9] Elmagarmid A K, Ipeirotis P G, Verykios V S.Duplicate Record Detection: A Survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[10] Winkler W E. Overview of Record Linkage and Current Research Directions [R]. Washington, D C: U.S. Census Brueau, 2006.
[11] Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh: A Generic Approach to Entity Resolution[C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 255-276.
[12] Bhattacharya I, Getoor L.Collective Entity Resolution in Relational Data [J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No. 5.
[13] Manning C D, Raghavan P, Schütze H, et al. Introduction to Information Retrieval [M]. Cambridge University Press, 2008: 496.
[14] Arasu A, Gotz M, Kaushik R. On Active Learning of Record Matching Packages [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 783-794.
[15] 刘骏豪, 孙晶莹.2011 年德国人口普查中的新技术——记录连接[J]. 中国统计, 2011(11): 38-39. (Liu Junhao, Sun Jingying. The New Technology in 2011 German Population Census——Record Connection [J]. China Statistics, 2011(11): 38-39.)
[16] 谭明超, 刁兴春, 曹建军.实体分辨研究综述[J]. 计算机科学, 2014, 41(4): 9-12, 20. (Tan Mingchao, Diao Xingchun, Cao Jianjun. Survey on Entity Resolution [J]. Computer Science, 2014, 41(4): 9-12, 20.)
[17] Müller H, Freytag J-C. Problems, Methods, and Challenges in Comprehensive Data Cleansing [M]. Humboldt University Berlin, 2003.
[18] Record Linkage in Large Data Sets [EB/OL]. [2014-12-02]. http://www.dani-sola.com/record-linkage-in-large-data-sets/.
[19] Herzog T N, Scheuren F J, Winkler W E. Data Quality and Record Linkage Techniques [M]. Springer-Verlag, 2007.
[20] Winkler W E. Methods for Record Linkage and Bayesian Networks [R]. Statistical Research Division, US Census Bureau, Washington, DC, 2002.
[21] Whang S E, Garcia-Molina H. Entity Resolution with Evolving Rules [C]. In: Proceedings of the 36th International Conference on Very Large Data Bases, Singapore. 2010: 1326-1337.
[22] Whang S E, Garcia-Molina H.Incremental Entity Resolution on Rules and Data [J]. The VLDB Journal, 2014, 23(1): 77-102.
[23] Whang S E, Garcia-Molina H.Developments in Generic Entity Resolution [J]. IEEE Data Engineering Bulletin, 2011, 13(11): 24-30.
[24] Whang S E, Menestrina D, Koutrika G, et al. Entity Resolution with Iterative Blocking [C]. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, Rhode Island, USA. New York: ACM, 2009: 219-232.
[25] Gruenheid A, Dong X L, Srivastava D. Incremental Record Linkage [C]. In: Proceedings of the 40th International Conference on Very Large Data Bases, Hangzhou, China, 2014: 697-708.
[26] Sarawagi S, Deshpande V, Kasliwal S. Efficient Top-k Count Queries over Imprecise Duplicates [C]. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, Saint Petersburg, Russia. New York: ACM, 2009: 450-461.
[27] Hernández M A, Stolfo S J.Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem [J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.
[28] Mathieu C, Sankur O, Schudy W.Online Correlation Clustering [OL]. ArXiv Preprint arXiv: 10010920.
[29] Charikar M, Chekuri C, Feder T, et al. Incremental Clustering and Dynamic Information Retrieval [C]. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC'97). New York: ACM, 1997: 626-635.
[30] Aggarwal C C, Han J, Wang J, et al. A Framework for Clustering Evolving Data Streams [C].In: Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany.2003: 81-92.
[31] Singla P, Domingos P. Collective Object Identification [C]. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, Scotland. San Francisco: Morgan Kaufmann Publishers Inc., 2005: 1636-1637.
[32] Christen P. Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA. New York: ACM, 2008: 151-159.
[33] 楼俊杰, 徐从富, 郝春亮.基于马尔科夫逻辑网络的实体解析改进算法[J]. 计算机科学, 2010, 37(8): 243-247. (Lou Junjie, Xu Congfu, Hao Chunliang. Improvement of Entity Resolution Based on Markov Logic Networks [J]. Computer Science, 2010, 37(8): 243-247.)
[34] Chaudhuri S, Ganti V, Xin D.Mining Document Collections to Facilitate Accurate Approximate Entity Matching [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 395-406.
[35] Shu L, Bo L, Meng W. A Latent Topic Model for Complete Entity Resolution [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09). IEEE, 2009: 880-891.
[36] Rastogi V, Dalvi N, Garofalakis M. Large-scale Collective Entity Matching [C]. In: Proceedings of the 37th International Conference on Very Large Data Bases, Seattle, Washington, USA.2011: 208-218.
[37] Getoor L, Machanavajjhala A.Entity Resolution: Theory, Practice & Open Challenges [C]. In: Proceedings the 38th International Conference on Very Large Data Bases, Istanbul, Turkey. 2012: 2018-2019.
[38] McCallum A, Nigam K, Ungar L H. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching [C]. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. New York: ACM, 2000: 169-178.
[39] 甄灵敏, 杨晓春, 王斌, 等.基于属性权重的实体解析技术 [J]. 计算机研究与发展, 2013, 50(S1): 281-289. (Zhen Lingmin, Yang Xiaochun, Wang Bin, et al. An Entity Resolution Approach Based on Attributes Weights [J]. Journal of Computer Research and Development, 2013, 50(S1): 281-289.)
[40] Kim H S, Lee D. HARRA: Fast Iterative Hashed Record Linkage for Large-scale Data Collections [C]. In: Proceedings of the 13th International Conference on Extending Database Technology, Lausanne, Switzerland. New York: ACM, 2010: 525-536.
[41] Vernica R, Carey M J, Li C. Efficient Parallel Set-similarity Joins Using MapReduce [C]. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA. New York: ACM, 2010: 495-506.
[42] Bilenko M, Kamath B, Mooney R J. Adaptive Blocking: Learning to Scale up Record Linkage [C]. In: Proceedings of the 6th International Conference on Data Mining (ICDM'06), Hong Kong, China.IEEE, 2006: 87-96.
[43] Baxter R, Christen P, Churches T. A Comparison of Fast Blocking Methods for Record Linkage [C]. In: Proceedings of the 1st Workshop on Data Cleaning, Record Linkage and Object Consolidation (KDD'03), Washington, DC, USA. 2003: 25-27.
[44] Kirsten T, Kolb L, Hartung M, et al.Data Partitioning for Parallel Entity Matching [OL]. arXiv Preprint arXiv: 10065309.
[45] Koudas N, Marathe A, Srivastava D. Flexible String Matching Against Large Databases in Practice [C]. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04), Toronto, Canada. 2004: 1078-1086.
[46] Chaudhuri S, Ganti V, Kaushik R. A Primitive Operator for Similarity Joins in Data Cleaning [C]. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE' 06). Washington DC: IEEE Computer Society, 2006: 5.
[47] Xiao C, Wang W, Lin X, et al. Efficient Similarity Joins for Near Duplicate Detection [C]. In: Proceedings of the 17th International Conference on World Wide Web, Beijing, China.New York: ACM, 2008: 131-140.
[48] Papapetrou P, Athitsos V, Kollios G, et al.Reference-based Alignment in Large Sequence Databases [C]. In: Proceedings of the 35th International Conference on Very Large Data Bases, Lyon, France.2009: 205-216.
[49] Li C, Lu J, Lu Y. Efficient Merging and Filtering Algorithms for Approximate String Searches [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering, Cancun, Mexico.IEEE Computer Society, 2008: 257-266.
[50] Li C, Wang B, Yang X. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-length Grams [C]. In: Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria.2007: 303-314.
[51] Yang X, Wang B, Li C. Cost-based Variable-length-gram Selection for String Collections to Support Approximate Queries Efficiently [C]. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada.New York: ACM, 2008: 353-364.
[52] Behm A, Shengyue J, Chen L, et al. Space-Constrained Gram-Based Indexing for Efficient Approximate String Search [C]. In: Proceedings of IEEE 25th International Conference on Data Engineering (ICDE'09), Shanghai, China.IEEE, 2009: 604-615.
[53] 邱越峰, 田增平, 季文赟, 等.一种高效的检测相似重复记录的方法 [J]. 计算机学报, 2001, 24(1): 69-77. (Qiu Yuefeng, Tian Zengping, Ji Wenyun, et al. An Efficient Approach for Detecting Approximately Duplicate Database
Records [J]. Chinese Journal of Computers, 2001, 24(1): 69-77.)
[54] Lieberman M D, Sankaranarayanan J, Samet H. A Fast Similarity Join Algorithm Using Graphics Processing Units [C]. In: Proceedings of the IEEE 24th International Conference on Data Engineering.Washington DC: IEEE Computer Society, 2008: 1111-1120.
[55] 燕彩蓉, 万永权.并行实体解析与记录聚合模型 [J]. 小型微型计算机系统, 2013, 34(8): 1843-1847. (Yan Cairong, Wan Yongquan. Parallel Entity Resolution and Record Aggregation Model [J]. Journal of Chinese Computer Systems, 2013, 34(8): 1843-1847.)
[56] 燕彩蓉, 张洋舜, 徐光伟.支持隐私保护的众包实体解析 [J]. 计算机科学与探索, 2014, 8(7): 802-811. (Yan Cairong, Zhang Yangshun, Xu Guangwei. Crowdsourcing Entity Resolution with Privacy Protection [J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(7): 802-811.)
[57] 王宁, 李杰.大数据环境下用于实体解析的两层相关性聚类方法 [J]. 计算机研究与发展, 2014, 51(9): 2108-2116. (Wang Ning, Li Jie. Two-Tiered Correlation Clustering Method for Entity Resolution in Big Data [J]. Journal of Computer Research and Development, 2014, 51(9): 2108-2116.)
[58] 杨丹, 申德荣, 于戈, 等.数据空间中时间为中心的集合实体识别策略[J]. 计算机科学与探索, 2012, 6(11): 974-984. (Yang Dan, Shen Derong, Yu Ge, et al. Time-centered Collective Entity Resolution Strategy in Dataspace [J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(11): 974-984.)

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	李文娜,张智雄. 基于置信学习的知识库错误检测方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[3]	孙羽, 裘江南. 基于网络分析和文本挖掘的意见领袖影响力研究 [J]. 数据分析与知识发现, 0, (): 1-.
[4]	王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究^*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[5]	李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[6]	王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[7]	杨晗迅, 周德群, 马静, 罗永聪. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究*[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[8]	徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9]	黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展^*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[10]	王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究^*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[11]	阮小芸,廖健斌,李祥,杨阳,李岱峰. 基于人才知识图谱推理的强化学习可解释推荐研究^*[J]. 数据分析与知识发现, 2021, 5(6): 36-50.
[12]	刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法^*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13]	陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法^*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[14]	张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测^*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15]	闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 ^*[J]. 数据分析与知识发现, 2021, 5(4): 80-89.

Viewed

Full text

Abstract

Cited

Shared

Discussed