Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (9): 1-9     https://doi.org/10.11925/infotech.2096-3467.2021.0179
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于置信学习的知识库错误检测方法研究*
李文娜1,2,张智雄1,2,3()
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
3科技大数据湖北省重点实验室 武汉 430071
Research on Knowledge Base Error Detection Method Based on Confidence Learning
Li Wenna1,2,Zhang Zhixiong1,2,3()
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Mangement, School of Economic and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China
全文: PDF (1240 KB)   HTML ( 35
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决知识库中存在的噪声数据问题,对基于置信学习的知识库错误检测方法进行探索。【方法】 利用TransE模型对知识库三元组进行向量表示,通过多层感知机模型进行错误检测识别,然后利用置信学习对样本集进行清洗,并通过多轮迭代训练,降低噪声数据对模型的影响。【结果】 所提方法在DBpedia数据集上,最优F1值达到0.736 4,优于对照组方法。【局限】 实验数据集中的噪声数据由人工产生,与真实噪声数据分布有一定差异,在更大规模知识库上的通用性有待考证。【结论】 探索了基于置信学习的知识库错误检测方法,通过置信学习降低了噪声数据的影响,从而在知识库错误检测任务中有较好性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李文娜
张智雄
关键词 知识库错误检测置信学习    
Abstract

[Objective] This paper explores the error detection method for knowledge base with the help of confidence learning, aiming to reduce the noise data. [Objective] We used the TransE model to represent knowledge base triples, and used the multi-layer perceptron model to detect errors. Then, we cleaned the dataset with confidence learning, and reduced the influence of noise data through multiple rounds of iterative training. [Results] We examined our new method with DBpedia datasets, and found the optimal F1 value reached 0.736 4, which is better than the control group. [Limitations] The noise data in the experiment was artificially generated and was different from the distribution of real world data. More research is needed to evaluate our method with larger knowledge bases. [Conclusions] The proposed method could reduce the influence of noise data through confidence learning, and more effectively detect knowledge base errors.

Key wordsKnowledge Base    Error Detection    Confidence Learning
收稿日期: 2021-01-23      出版日期: 2021-10-15
ZTFLH:  TP393  
基金资助:*中国科学院文献情报能力建设专项课题的研究成果之一(2019WQZX0017)
通讯作者: 张智雄     E-mail: zhangzhx@mail.las.ac.cn
引用本文:   
李文娜,张智雄. 基于置信学习的知识库错误检测方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
Li Wenna,Zhang Zhixiong. Research on Knowledge Base Error Detection Method Based on Confidence Learning. Data Analysis and Knowledge Discovery, 2021, 5(9): 1-9.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0179      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I9/1
Fig.1  基于置信学习的知识库错误检测方法框架
Fig.2  置信学习模块流程
C y ˜ , y y=0 y=1
y ˜=0 C0,0 C0,1
y ˜=1 C1,0 C1,1
Table 1  模型预测混淆矩阵
Q y ˜ , y y=0 y=1
y ˜=0 Q0,0 Q0,1
y ˜=1 Q1,0 Q1,1
Table 2  联合概率分布矩阵
Fig.3  实验数据集构建流程
数据集 TransE C-TransE
Precision Recall F1 Precision Recall F1
E1 0.787 9 0.721 0 0.703 8 0.797 8 0.747 5 0.736 4(+4.63%)
E2 0.793 1 0.719 5 0.700 7 0.790 8 0.736 0 0.723 0(+3.18%)
E5 0.786 4 0.701 0 0.676 9 0.785 2 0.731 0 0.717 6(+6.01%)
E10 0.771 6 0.656 5 0.615 8 0.758 4 0.692 5 0.671 6(+9.06%)
E15 0.745 9 0.552 0 0.442 0 0.756 1 0.679 5 0.653 6(+47.87%)
E20 0.250 0 0.500 0 0.333 3 0.731 2 0.661 5 0.633 9(+90.18%)
Table 3  不同噪声比例数据集上对照实验结果
Fig.4  不同噪声比例数据集上模型效果对比
Fig.5  不同噪声比例数据集上模型稳定性对比
Fig.6  DBpedia真实数据集上Top100错误的人工标注结果对比
头实体 关系 尾实体 错误类型
Bertram Kelly significant project Isle of Man 关系错误
Chandigarh government type Government of
India
实体错误
George Latham
(footballer)
team Newtown A.F.C. 过时数据
Northwest Airlines lounge Northwest Airlines 实体错误
Hammersmith borough Fulham 实体错误
South African Military Health Service garrison Pretoria 实体错误
Stuart Boardley team Long Melford F.C. 实体错误
Jong Ajax chairman AFC Ajax 关系错误
Philadelphia Union chairman Philadelphia Union 实体错误
Burt Bacharach instrument McGill University 实体错误
Table 4  DBpedia真实数据集上检测发现的错误三元组
[1] Dong X, Gabrilovich E, Heitz G, et al. Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 601-610.
[2] Auer S, Bizer C, Kobilarov G, et al. DBpedia: A Nucleus for a Web of Open Data [C]//Proceedings of International Semantic Web Conference, Asian Semantic Web Conference. 2007: 722-735.
[3] Bollacker K, Evans C, Paritosh P, et al. FreeBase: A Collaboratively Created Graph Database for Structuring Human Knowledge [C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008: 1247-1250.
[4] Heindorf S, Potthast M, Stein B, et al. Vandalism Detection in Wikidata [C]//Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 327-336.
[5] Aktolga E, Cartright M A, Allan J. Cross-document Cross-lingual Coreference Retrieval [C]//Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008: 1359-1360.
[6] Pilz A, Paaß G. From Names to Entities Using Thematic Context Distance [C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011: 857-866.
[7] Vapnik V N, Lerner A Y. Recognition of Patterns with Help of Generalized Portraits[J]. Avtomatika i Telemekhanika, 1963, 24(6):774-780.
[8] Carlson A, Betteridge J, Wang R C, et al. Coupled Semi-supervised Learning for Information Extraction [C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010: 101-110.
[9] Bordes A, Usunier N, Garcia-Durán A, et al. Translating Embeddings for Modeling Multi-Relational Data [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2787-2795.
[10] Lin Y K, Liu Z Y, Sun M S, et al. Learning Entity and Relation Embeddings for Knowledge Graph Completion [C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2181-2187.
[11] Wang Z, Zhang J W, Feng J L, et al. Knowledge Graph Embedding by Translating on Hyperplanes [C]//Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2014: 1112-1119.
[12] Xie R B, Liu Z Y, Lin F, et al. Does William Shakespeare Really Write Hamlet? Knowledge Representation Learning with Confidence[OL]. arXiv Preprint, arXiv: 1705.03202.
[13] Fasoulis R, Bougiatiotis K, Aisopos F, et al. Error Detection in Knowledge Graphs: Path Ranking, Embeddings or Both?[OL]. arXiv Preprint,arXiv: 2002. 08762.
[14] Lin Y K, Liu Z Y, Luan H B, et al. Modeling Relation Paths for Representation Learning of Knowledge Bases[OL]. arXiv Preprint, arXiv: 1506.00379.
[15] Zhao Y, Feng H L, Gallinari P. Embedding Learning with Triple Trustiness on Noisy Knowledge Graph[J]. Entropy, 2019, 21(11):1083.
doi: 10.3390/e21111083
[16] Jia S B, Xiang Y, Chen X J, et al. Triple Trustworthiness Measurement for Knowledge Graph [C]//Proceedings of the World Wide Web Conference. 2019: 2865-2871.
[17] Northcutt C, Jiang L, Chuang I L. Confident Learning: Estimating Uncertainty in Dataset Labels[J]. Journal of Artificial Intelligence Research, 2021, 70:1373-1411.
doi: 10.1613/jair.1.12125
[18] Rosenblatt F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms[R]. Cornell Aeronautical Lab Inc Buffalo NY, 1961.
[19] Sun Z Q, Zhang Q H, Hu W, et al. A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs[J]. Proceedings of the VLDB Endowment, 2020, 13(12):2326-2340.
doi: 10.14778/3407790.3407828
[1] 卢利农,祝忠明,张旺强,王小春. 基于Lingo3G聚类算法的机构知识库跨库知识整合与知识指纹服务实现[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[2] 温萍梅,叶志炜,丁文健,刘颖,徐健. 命名实体消歧研究进展综述*[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[3] 祁瑞华,周俊艺,郭旭,刘彩虹. 基于知识库的图书评论主题抽取研究*[J]. 数据分析与知识发现, 2019, 3(6): 83-91.
[4] 张旺强,祝忠明,李雅梅,卢利农,刘巍. 机构知识库作者名自动消歧框架设计与实践*[J]. 数据分析与知识发现, 2019, 3(6): 92-98.
[5] 吴志强,祝忠明,刘巍,王思丽. CSpace知识分析与可视化功能扩展研究与实践*[J]. 数据分析与知识发现, 2019, 3(3): 112-119.
[6] 吴志强, 祝忠明, 姚晓娜, 王思丽. CSpace机构知识库影音资源支持能力扩展研究与实践*[J]. 数据分析与知识发现, 2017, 1(9): 90-96.
[7] 陈果, 肖璐. 网络社区中的知识元链接体系构建研究*[J]. 数据分析与知识发现, 2017, 1(11): 75-83.
[8] 王思丽, 刘巍, 祝忠明, 吴志强, 王金平. 基于CSpace的科技信息可配置化自动监测功能设计与实现*[J]. 数据分析与知识发现, 2017, 1(10): 85-93.
[9] 吴志强, 祝忠明, 刘巍, 张旺强, 姚晓娜. 机构知识库三维模型检索与展示技术研究与实践*[J]. 数据分析与知识发现, 2017, 1(1): 73-80.
[10] 周鹏程,武川,陆伟. 基于多知识库的短文本实体链接方法研究*——以Wikipedia和Freebase为例[J]. 现代图书情报技术, 2016, 32(6): 1-11.
[11] 张旺强,祝忠明,姚晓娜,刘巍. 基于开放获取论文推送转发服务系统iSwitch的机构知识库内容建设*[J]. 现代图书情报技术, 2016, 32(4): 91-96.
[12] 刘峰,黎建辉,张进,韩芳,刘昂. TeamDR:面向科研团队的数据知识库管理系统*[J]. 现代图书情报技术, 2016, 32(3): 82-89.
[13] 翟东升, 刘鹤, 张杰, 蔡力伟. 基于图形数据库的专利语义知识库构建技术研究[J]. 数据分析与知识发现, 2016, 32(12): 66-75.
[14] 钱力, 师洪波, 张晓林, 梁娜. 开放获取论文推送转发服务系统iSwitch: 论文分发推送[J]. 现代图书情报技术, 2015, 31(6): 7-12.
[15] 严潮斌, 陈嘉勇, 侯瑞芳, 李玲, 周婕. 查收查引服务支撑需求驱动下的高校机构知识库建设[J]. 现代图书情报技术, 2015, 31(5): 94-100.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn