Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (5): 27-40     https://doi.org/10.11925/infotech.2096-3467.2018.1388
  综述评介 本期目录 | 过刊浏览 | 高级检索 |
关于实体解析基本方法的研究和述评*
高广尚()
桂林理工大学商学院 桂林 541004
Reviewing Basic Methods of Entity Resolution
Guangshang Gao()
Business School, Guilin University of Technology, Guilin 541004, China
全文: PDF (851 KB)   HTML ( 6
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】探讨实体解析理论中经典的实体解析方法及逻辑思路。【文献范围】在Google Scholar和CNKI中分别以检索词“Entity Resolution”、“Collective Analysis”、“Crowdsourced”、“Active Learning”、“Privacy-Preserving”和“实体解析”进行文献检索, 再结合主题筛选, 精读并使用追溯法获得实体解析研究的代表性文献共86篇。【方法】针对每种实体解析方法, 归纳分析该方法的基本思想, 并通过图示直观地呈现其中的解析过程; 重点分析梳理方法实现过程中, 现有研究所采用的关键策略、算法或技术等。【结果】实体解析是数据质量管理的基本操作, 也是发现数据价值的关键步骤。【局限】未深入分析各实体解析方法的评价指标和应用情况。【结论】尽管现有实体解析方法能在一定程度上满足大部分应用的需求, 但在大数据环境下其仍然面临着数据混杂性、隐私保护和分布式环境等方面的挑战。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
高广尚
关键词 实体解析协同分析众包主动学习隐私保护    
Abstract

[Objective] This paper discusses the classical entity resolution methods and logical thinking in entity resolution theory. [Coverage] Google Scholar and CNKI were respectively used to search literatures with the keywords “Entity Resolution”, “Collective Analysis”, “Crowdsourced”, “Active Learning”, “Privacy-Preserving” and “Entity Resolution” in Chinese. I then obtained a total of 86 representative literatures in conjunction with topic screening, intensive reading and retrospective method. [Methods] For each entity resolution method, the paper first summarizes and analyzes the basic idea of the method, and presents the resolution process through illustration, and then focuses on analyzing the key strategies, algorithms or techniques adopted by the existing research in the process of implementation of the method. [Results] Entity resolution is the basic operation of data quality management, and the key step to find the value of data. [Limitations] There is no in-depth analysis of the evaluation indicators and application of each entity resolution method. [Conclusions] Although existing entity resolution methods can meet the requirements of most applications to some extent, they still face challenges in data heterogeneity, privacy protection and distributed environment in the big data environment.

Key wordsEntity Resolution    Collective Analysis    Crowdsourced    Active Learning    Privacy-Preserving
收稿日期: 2018-12-07      出版日期: 2019-07-03
基金资助:*本文系国家自然科学基金项目“面向数据演化的增量实体解析方法研究”(项目编号:71761008)和广西高校人文社会科学重点研究基地基金项目“面向企业数据治理的数据质量改善研究”(项目编号: 16YB010)的研究成果之一
引用本文:   
高广尚. 关于实体解析基本方法的研究和述评*[J]. 数据分析与知识发现, 2019, 3(5): 27-40.
Guangshang Gao. Reviewing Basic Methods of Entity Resolution. Data Analysis and Knowledge Discovery, 2019, 3(5): 27-40.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1388      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I5/27
[1] 孟小峰, 杜治娟. 大数据融合研究: 问题与挑战[J]. 计算机研究与发展, 2016, 53(2): 231-246.
[1] (Meng Xiaofeng, Du Zhijuan.Research on the Big Data Fusion: Issues and Challenges[J]. Journal of Computer Research and Development, 2016, 53(2): 231-246.)
[2] 李建中, 王宏志, 高宏. 大数据可用性的研究进展[J]. 软件学报, 2016, 27(7): 1605-1625.
[2] (Li Jianzhong, Wang Hongzhi, Gao Hong.State-of-the-Art of Research on Big Data Usability[J]. Journal of Software, 2016, 27(7): 1605-1625.)
[3] Dong X L, Srivastava D.Big Data Integration[C]// Proceedings of the 29th International Conference on Data Engineering. 2013: 1245-1248.
[4] Getoor L, Machanavajjhala A.Entity Resolution: Theory, Practice & Open Challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019.
[5] Dong X L, Rekatsinas T.Data Integration and Machine Learning: A Natural Synergy[C]// Proceedings of the 2018 International Conference on Management of Data. 2018: 1645-1650.
[6] Newcombe H B, Kennedy J M, Axford S J, et al.Automatic Linkage of Vital Records[J]. Science, 1959, 130(3381): 954-959.
[7] Talburt J R.Entity Resolution and Information Quality[M]. Elsevier, 2011.
[8] Fellegi I P, Sunter A B.A Theory for Record Linkage[J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[9] Cohen W W, Ravikumar P, Fienberg S E.A Comparison of String Metrics for Matching Names and Records[A]// KDD Workshop on Data Cleaning & Object Consolidation[M]. 2003, 4(2): 73-78.
[10] Naumann F, Herschel M.An Introduction to Duplicate Detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.
[11] Doan A, Halevy A, Ives Z.Principles of Data Integration[M]. Elsevier, 2012.
[12] Levenshtein V I.Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Doklady Akademii Nauk SSSR, 1965, 163(4): 845-848.
[13] Jaro M A.Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida[J]. Journal of the American Statistical Association, 1989, 84(406): 414-420.
[14] Winkler W E, Thibaudeau Y.An Application of the Felligi Sunter Model of Record Linkage to the 1990 US Decennial Census[R]. Washington: US Bureau of the Census, 1991.
[15] On B W, Koudas N, Lee D, et al.Group Linkage[C]// Proceedings of the 23rd International Conference on Data Engineering. 2007: 496-505.
[16] Bilenko M, Mooney R J, Cohen W W, et al.Adaptive Name Matching in Information Integration[J]. IEEE Intelligent Systems, 2003, 18(5): 16-23.
[17] Cohen W W, Ravikumar P, Fienberg S E.A Comparison of String Distance Metrics for Name-Matching Tasks[C]// Proceedings of the 2003 International Joint Conference on Artificial Intelligence. 2003: 73-78.
[18] Monge A E, Elkan C.The Field Matching Problem: Algorithms and Applications[C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996: 267-270.
[19] Whang S E, Garcia-Molina H.Developments in Generic Entity Resolution[A]// Bulletin of the Technology Committee on Data Engineering[M]. IEEE Computer Society, 2011.
[20] Benjelloun O, Garcia-Molina H, Kawai H, et al.Generic Entity Resolution in the Serf Project[R]. Stanford InfoLab, 2006.
[21] Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh: A Generic Approach to Entity Resolution[J]. The VLDB Journal, 2009, 18(1): 255-276.
[22] Kawai H, Garcia-Molina H, Benjelloun O, et al.P-Swoosh: Parallel Algorithm for Generic Entity Resolution[R]. 2006.
[23] Kim H S, Lee D.Parallel Linkage[C]// Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007: 283-292.
[24] Kawai H, Garcia-Molina H, Benjelloun O, et al.Bufoosh: Buffering Algorithms for Generic Entity Resolution[R]. 2006.
[25] Benjelloun O, Garcia-Molina H, Gong H, et al.D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution[C]// Proceedings of the 27th International Conference on Distributed Computing Systems. IEEE, 2007.
[26] Bhattacharya I, Getoor L.Collective Entity Resolution in Relational Data[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 5-18.
[27] Malin B, Airoldi E, Carley K M.A Network Analysis Model for Disambiguation of Names in Lists[J]. Computational & Mathematical Organization Theory, 2005, 11(2): 119-139.
[28] 高学东, 黄月. 异质对象协同实体解析的联合聚类算法[J]. 系统工程理论与实践, 2015, 35(4): 997-1004.
[28] (Gao Xuedong, Huang Yue.Co-Clustering Algorithm for Collective Entity Resolution of Multi-Typed Objects[J]. Systems Engineering - Theory & Practice, 2015, 35(4): 997-1004.)
[29] Dong X, Halevy A, Madhavan J.Reference Reconciliation in Complex Information Spaces[C]// Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. 2005: 85-96.
[30] 孙琛琛, 申德荣, 寇月, 等. 面向关联数据的联合式实体识别方法[J]. 计算机学报, 2015, 38(9): 1739-1754.
[30] (Sun Chenchen, Shen Derong, Kou Yue, et al.A Related Data Oriented Joint Entity Resolution Approach[J]. Chinese Journal of Computers, 2015, 38(9): 1739-1754.)
[31] Kalashnikov D V.Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph[J]. ACM Transactions on Database Systems, 2006, 31(2): 716-767.
[32] 冯剑红, 李国良, 冯建华. 众包技术研究综述[J]. 计算机学报, 2015, 38(9): 1713-1726.
[32] (Feng Jianhong, Li Guoliang, Feng Jianhua.A Survey on Crowdsourcing[J]. Chinese Journal of Computers, 2015, 38(9): 1713-1726.)
[33] Lee J, Cho H, Park J W, et al.Hybrid Entity Clustering Using Crowds and Data[J]. The VLDB Journal, 2013, 22(5): 711-726.
[34] Wang J, Kraska T, Franklin M J, et al.CrowdER: Crowdsourcing Entity Resolution[J]. Proceedings of the VLDB Endowment, 2012, 5(11): 1483-1494.
[35] Wang S, Xiao X, Lee C H.Crowd-Based Deduplication: An Adaptive Approach[C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 2015: 1263-1277.
[36] Demartini G, Difallah D E, Cudré-Mauroux P, et al.ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking[C]// Proceedings of the 21st International Conference on World Wide Web. 2012: 469-478.
[37] Liu X, Lu M, Ooi B C, et al.CDAS: A Crowdsourcing Data Analytics System[J]. Proceedings of the VLDB Endowment, 2012, 5(10): 1040-1051.
[38] Settles B.Active Learning Literature Survey[R]. University of Wisconsinmadison, 2009.
[39] 杨文柱, 田潇潇, 王思乐, 等. 主动学习算法研究进展[J]. 河北大学学报:自然科学版, 2017, 37(2): 216-224.
[39] (Yang Wenzhu, Tian Xiaoxiao, Wang Sile, et al.Recent Advances in Active Learning Algorithms[J]. Journal of Hebei University: Natural Science Edition, 2017, 37(2): 216-224.)
[40] Sarawagi S, Bhamidipaty A.Interactive Deduplication Using Active Learning[C]// Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002: 269-278.
[41] Tejada S, Knoblock C A, Minton S.Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification[C]// Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002: 350-359.
[42] Bellare K, Iyengar S, Parameswaran A G, et al.Active Sampling for Entity Matching[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012: 1131-1139.
[43] Arasu A, Kaushik R.On Active Learning of Record Matching Packages[C]// Proceeding of the 10th ACM SIGMOD International Conference on Management of Data. 2010: 783-794.
[44] Qian K, Popa L, Sen P.Active Learning for Large-Scale Entity Resolution[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1379-1388.
[45] Fisher J, Christen P, Wang Q.Active Learning Based Entity Resolution Using Markov Logic[C]// Proceedings of the 2016 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2016: 338-349.
[46] Fu Y, Zhu X, Li B.A Survey on Instance Selection for Active Learning[J]. Knowledge and Information Systems, 2013, 35(2): 249-283.
[47] Ramadan B.Indexing Techniques for Real-Time Entity Resolution[D]. Canberra: The Australian National University, 2016.
[48] Ramadan B, Christen P, Liang H, et al. Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution[J]. Journal of Data and Information Quality (JDIQ), 2015, 6(4): Article No.15.
[49] Bayardo R J, Ma Y, Srikant R.Scaling up All Pairs Similarity Search[C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 131-140.
[50] Broder A Z, Carmel D, Herscovici M, et al.Efficient Query Evaluation Using a Two-Level Retrieval Process[C]// Proceedings of the 12th International Conference on Information and Knowledge Management. 2003: 426-434.
[51] Zobel J, Moffat A.Inverted Files for Text Search Engines[J]. ACM Computing Surveys(CSUR), 2006, 38(2): 6.
[52] Christen P.A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537-1555.
[53] Christen P, Gayler R, Hawking D.Similarity-Aware Indexing for Real-Time Entity Resolution[C]// Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009: 1565-1568.
[54] Ramadan B, Christen P, Liang H, et al.Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution[C]// Proceedings of the 2013 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2013: 47-58.
[55] Ramadan B, Christen P.Unsupervised Blocking Key Selection for Real-Time Entity Resolution[C]// Proceedings of the 2015 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2015: 574-585.
[56] Papadakis G, Koutrika G, Palpanas T, et al.Meta-Blocking: Taking Entity Resolutionto the Next Level[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8): 1946-1960.
[57] Efthymiou V, Papadakis G, Papastefanatos G, et al.Parallel Meta-Blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data[C]// Proceedings of the 2015 IEEE International Conference on Big Data. 2015: 411-420.
[58] Dragut E C, Ouzzani M, Elmagarmid A K.Query-Time Record Linkage and Fusion over Web Databases[C]// Proceedings of the 31st International Conference on Data Engineering. IEEE, 2015: 42-53.
[59] Gruenheid A, Dong X L, Srivastava D.Incremental Record Linkage[J]. Proceedings of the VLDB Endowment, 2014, 7(9): 697-708.
[60] Whang S E, Marmaros D, Garcia-Molina H.Pay-as-you-go Entity Resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5): 1111-1124.
[61] Hall R, Fienberg S E.Privacy-Preserving Record Linkage[C]// Proceedings of the 2013 International Conference on Privacy in Statistical Databases. 2010: 269-283.
[62] 韩姝敏, 申德荣, 聂铁铮, 等. 一种基于隐私保护下的多方记录链接方法[J]. 软件学报, 2017, 28(9): 2281-2292.
[62] (Han Shumin, Shen Derong, Nie Tiezheng, et al.Multi-Party Privacy-Preserving Record Linkage Apprpoach[J]. Journal of Software, 2017, 28(9): 2281-2292.)
[63] Kuzu M, Kantarcioglu M, Inan A, et al.Efficient Privacy-Aware Record Integration[C]// Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013: 167-178.
[64] Sweeney L.K-Anonymity: A Model for Protecting Privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 557-570.
[65] Sweeney L.Achieving K-Anonymity Privacy Protection Using Generalization and Suppression[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 571-588.
[66] Kantarcioglu M, Jiang W, Malin B.A Privacy-Preserving Framework for Integrating Person-Specific Databases[C]// Proceedings of the 2008 International Conference on Privacy in Statistical Databases. Springer, 2008: 298-314.
[67] Dwork C.Differential Privacy: A Survey of Results[C]// Proceedings of the 5th International Conference on Theory and Applications of Models of Computation. 2008: 1-19.
[68] Inan A, Kantarcioglu M, Ghinita G, et al.Private Record Matching Using Differential Privacy[C]// Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010: 123-134.
[69] 张啸剑, 孟小峰. 面向数据发布和分析的差分隐私保护[J]. 计算机学报, 2014, 37(4): 927-949.
[69] (Zhang Xiaojian, Meng Xiaofeng.Differential Privacy in Data Publication and Analysis[J]. Chinese Journal of Computers, 2014, 37(4): 927-949.)
[70] 孟小峰, 张啸剑. 大数据隐私管理[J]. 计算机研究与发展, 2015, 52(2): 265-281.
[70] (Meng Xiaofeng, Zhang Xiaojian.Big Data Privacy Management[J]. Journal of Computer Research and Development, 2015, 52(2): 265-281.)
[71] Schnell R, Bachteler T, Reiher J.Privacy-Preserving Record Linkage Using Bloom Filters[J]. BMC Medical Informatics and Decision Making, 2009, 9(1): 41.
[72] Bloom B H.Space/Time Trade-offs in Hash Coding with Allowable Errors[J]. Communications of the ACM, 1970, 13(7): 422-426.
[73] Jain N, Dahlin M, Tewari R.Using Bloom Filters to Refine Web Search Results[C]// Proceedings of the 8th International Workshop on the Web and Databases. 2005: 25-30.
[74] Durham E A, Kantarcioglu M, Xue Y, et al.Composite Bloom Filters for Secure Record Linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2956-2968.
[75] Vatsalan D, Christen P.Scalable Privacy-Preserving Record Linkage for Multiple Databases[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. ACM, 2014: 1795-1798.
[76] Randall S M, Ferrante A M, Boyd J H, et al.Privacy- Preserving Record Linkage on Large Real World Datasets[J]. Journal of Biomedical Informatics, 2014, 50: 205-212.
[77] Jurczyk P, Xiong L.Towards Privacy-Preserving Integration of Distributed Heterogeneous Data[C]// Proceedings of the 2nd PhD Workshop on Information and Knowledge Management. ACM, 2008: 65-72.
[78] Sheikh R, Mishra D K, Kumar B.Secure Multiparty Computation: From Millionaires Problem to Anonymizer[J]. Information Security Journal: A Global Perspective, 2011, 20(1): 25-33.
[79] Goldwasser S, Micali S, Rackoff C.The Knowledge Complexity of Interactive Proof Systems[J]. SIAM Journal on Computing, 1989, 18(1): 186-208.
[80] Quisquater J, Quisquater M, Quisquater M, et al.How to Explain Zero-Knowledge Protocols to Your Children[C]// Proceedings of the 1989 Conference on the Theory and Application of Cryptology. 1989: 628-631.
[81] Lindell Y, Pinkas B.Secure Multiparty Computation for Privacy-Preserving Data Mining[J]. Journal of Privacy & Confidentiality, 2009, 25(2): 761-766.
[82] 维克托·迈尔-舍恩伯格, 肯尼思·库克耶. 大数据时代:生活、工作与思维的大变革[M]. 杭州: 浙江人民出版社, 2013.
[82] (Viktor Mayer-Schonberger, Kenneth Cukier.Big Data: A Revolution That will Transform How We Live, Work, and Think[M]. Hangzhou: Zhejiang People’s Publishing House, 2013.)
[83] Kooli N, Allesiardo R, Pigneul E.Deep Learning Based Approach for Entity Resolution in Databases[C]// Proceedings of the 2018 Asian Conference on Intelligent Information and Database Systems. 2018: 3-12.
[84] Schmidhuber J.Deep Learning in Neural Networks: An Overview[J]. Neural Networks, 2015, 61: 85-117.
[85] Vatsalan D, Sehili Z, Christen P, et al.Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges[A]// Zomaya A, Sakr S. Handbook of Big Data Technologies[M]. 2017: 851-895.
[86] Altowim Y, Mehrotra S.Parallel Progressive Approach to Entity Resolution Using MapReduce[C]// Proceedings of the 33rd International Conference on Data Engineering. 2017: 909-920.
[1] 赵宇翔,练靖雯. 数字人文视域下文化遗产众包研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[2] 王树义,刘赛,马峥. 基于深度迁移学习的微博图像隐私分类研究*[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[3] 陶玥,余丽,张润杰. 科技文献中短语级主题抽取的主动学习方法研究*[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[4] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[5] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[6] 周成, 魏红芹. 基于随机森林属性约简的众包竞赛参与者识别体系研究*[J]. 数据分析与知识发现, 2018, 2(7): 46-54.
[7] 张亭亭, 赵宇翔, 朱庆华. 众包社区中基于敏感性分析的用户偏好挖掘模型及实验*[J]. 数据分析与知识发现, 2018, 2(5): 23-31.
[8] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[9] 高广尚, 张智雄. 关系数据库中实体解析研究综述[J]. 现代图书情报技术, 2015, 31(7-8): 37-47.
[10] 毕秋敏, 李明, 曾志勇. 一种主动学习和协同训练相结合的半监督微博情感分类方法[J]. 现代图书情报技术, 2015, 31(1): 38-44.
[11] 李兵, 徐玮佳, 张婧璇. 利用微博数据的产品评价方法研究——以“安卓系统”评价为例[J]. 现代图书情报技术, 2014, 30(4): 92-98.
[12] 胡新明, 罗建军, 夏火松. 基于商品领域知识的交互式推荐系统[J]. 现代图书情报技术, 2014, 30(10): 56-62.
[13] 王平水. 基于聚类的匿名化隐私保护技术研究[J]. 现代图书情报技术, 2010, 26(11): 53-58.
[14] 李书宁. 图书馆专用公共网络浏览器介绍[J]. 现代图书情报技术, 2005, 21(3): 20-22.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn