Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (11): 52-60     https://doi.org/10.11925/infotech.2096-3467.2022.0286
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于属性融合的多真值发现方法*
杨昊霖1,董永权1,2(),陈华凤1,张国玺1
1江苏师范大学计算机科学与技术学院 徐州 221008
2徐州市云计算工程技术研究中心 徐州 221100
Multi-Truth Discovery Method Based on Attribute Fusion
Yang Haolin1,Dong Yongquan1,2(),Chen Huafeng1,Zhang Guoxi1
1School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221008, China
2Xuzhou Engineering Research Center of Cloud Computing, Xuzhou 221100, China
全文: PDF (970 KB)   HTML ( 13
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决现有方法多数只侧重于多真值属性自身,缺少考虑辅助属性影响的问题,提高多真值发现的效果。【方法】 利用辅助属性计算数据源专业度和共识度,结合多真值属性值的活跃度得到数据源对冲突数据的支持度。通过调用已有真值发现方法获取真值伪标签,使用神经网络捕获数据源和冲突数据的复杂关系,最终推理出全部真值。【结果】 实验结果表明,与次优方法相比,在图书数据集上F1值提升2.25%,在电影数据集上F1值提升5.42%。【局限】 所提方法融合了反映对象特征的辅助属性,尚未探索其余辅助属性对多真值发现的影响。【结论】 基于多真值属性与辅助属性融合的方法提高了多真值发现的准确性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨昊霖
董永权
陈华凤
张国玺
关键词 多真值发现数据冲突信息质量多真值属性辅助属性    
Abstract

[Objective] This paper adds influence of auxiliary attributes to the existing models for multi-truth discovery, aiming to improve their F1 values. [Methods] First, we used the auxiliary attributes to calculate the source expertise and consensus degree. Then, we combined the activity degree of multi-truth attribute values to get the degree of support from the source for the conflicting data. Third, we called the existing truth discovery methods to obtain the pseudo tags of the truth. Finally, we used the neural network to capture the complex relationship between the sources and the conflicting data, and identified all truth. [Results] Compared with the sub-optimal model, our method improved the F1 value by 2.25% on the book dataset and by 5.42% on the movie dataset. [Limitations] The proposed method included auxiliary attributes reflecting object features, and more research is needed to explore the impacts of other auxiliary attributes on multi-truth discovery. [Conclusions] The proposed method could effectively discover multi-truth.

Key wordsMulti-Truth Discovery    Data Conflicts    Information Quality    Multi-Truth Attribute    Auxiliary Attribute
收稿日期: 2022-04-10      出版日期: 2023-01-13
ZTFLH:  TP311  
基金资助:* 国家自然科学基金项目(61872168);江苏师范大学研究生科研创新项目(2021XKT1381)
通讯作者: 董永权     E-mail: tomdyq@163.com
引用本文:   
杨昊霖,董永权,陈华凤,张国玺. 基于属性融合的多真值发现方法*[J]. 数据分析与知识发现, 2022, 6(11): 52-60.
Yang Haolin,Dong Yongquan,Chen Huafeng,Zhang Guoxi. Multi-Truth Discovery Method Based on Attribute Fusion. Data Analysis and Knowledge Discovery, 2022, 6(11): 52-60.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0286      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I11/52
数据源 演员列表 电影时长/min 电影种类
IMDB Dainel Radcliffe; Emma Waston; Rupert Grint 152 奇幻,冒险
FilmCrave Dainel Radcliffe 158 奇幻
Good Films Johnny Depp; Emma Waston; Dainel Radcliffe 155 奇幻,冒险
Movie Insider J. K. Rowling 142 奇幻
Table 1  4个网站提供的电影《哈利波特》的信息
电影网站 喜剧 奇幻 纪录片 科幻
IMDB 18 279 16 013 31 750 29 056
FilmCrave 14 000 1 501 5 523 6 781
Good Films 20 708 5 136 11 551 17 408
Movie Insider 51 253 8 082 33 044 30 003
总计 104 240 22 650 81 868 83 248
Table 2  电影网站提供不同种类电影的数量
Fig.1  基于属性融合的多真值发现方法流程
Fig.2  基于属性融合的多真值发现模型结构
方法 图书数据集 电影数据集
Recall Precision F1值 Recall Precision F1值
Majority Voting 0.712 1 0.870 0 0.783 1 0.577 6 0.834 8 0.681 5
TruthFinder 0.818 3 0.813 3 0.815 8 0.770 5 0.923 9 0.840 3
LTM 0.921 8 0.770 0 0.839 1 0.780 0 0.855 9 0.809 4
DART 0.973 1 0.575 5 0.723 2 0.926 2 0.783 8 0.848 7
AFMTD 0.889 6 0.828 6 0.858 0 0.912 8 0.877 4 0.894 7
Table 3  算法性能对比
Fig.3  阈值改变的影响
Fig.4  消融实验
[1] 刘伟, 孟小峰, 孟卫一. Deep Web数据集成研究综述[J]. 计算机学报, 2007, 30(9): 1475-1489.
[1] (Liu Wei, Meng Xiaofeng, Meng Weiyi. A Survey of Deep Web Data Integration[J]. Chinese Journal of Computers, 2007, 30(9): 1475-1489.)
[2] 李建中, 王宏志, 高宏. 大数据可用性的研究进展[J]. 软件学报, 2016, 27(7): 1605-1625.
[2] (Li Jianzhong, Wang Hongzhi, Gao Hong. State-of-the-Art of Research on Big Data Usability[J]. Journal of Software, 2016, 27(7): 1605-1625.)
[3] Bleiholder J, Naumann F. Data Fusion[J]. ACM Computing Surveys, 2009, 41(1): 1-41.
[4] Dong X L, Naumann F. Data Fusion- Resolving Data Conflicts for Integration[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1654-1655.
doi: 10.14778/1687553.1687620
[5] Li Y L, Gao J, Meng C S, et al. A Survey on Truth Discovery[J]. ACM SIGKDD Explorations Newsletter, 2016, 17(2): 1-16.
[6] Yin X X, Han J W, Yu P S. Truth Discovery with Multiple Conflicting Information Providers on the Web[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(6): 796-808.
doi: 10.1109/TKDE.2007.190745
[7] Dong X L, Berti-Equille L, Srivastava D. Truth Discovery and Copying Detection in a Dynamic World[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 562-573.
doi: 10.14778/1687627.1687691
[8] Dong X L, Berti-Équille L, Srivastava D. Integrating Conflicting Data: The Role of Source Dependence[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 550-561.
doi: 10.14778/1687627.1687690
[9] Galland A, Abiteboul S, Marian A, et al. Corroborating Information from Disagreeing Views[C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010: 131-140.
[10] Qi G J, Aggarwal C C, Han J, et al. Mining Collective Intelligence in Diverse Groups[C]// Proceedings of the 22nd International Conference on World Wide Web. 2013: 1041-1052.
[11] Zhao B, Rubinstein B I P, Gemmell J, et al. A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration[J]. Proceedings of the VLDB Endowment, 2012, 5(6): 550-561.
doi: 10.14778/2168651.2168656
[12] Zhao B, Han J W. A Probabilistic Model for Estimating Real-Valued Truth from Conflicting Sources[C]// Proceedings of the 10th International Workshop on Quality in Databases, in Conjunction with VLDB 2012. 2012.
[13] Wang X Z, Sheng Q Z, Fang X S, et al. An Integrated Bayesian Approach for Effective Multi-Truth Discovery[C]// Proceedings of the 24th ACM International Conference on Information and Knowledge Management. 2015: 493-502.
[14] 马如霞, 孟小峰. 基于数据源分类可信性的真值发现方法研究[J]. 计算机研究与发展, 2015, 52(9): 1931-1940.
[14] (Ma Ruxia, Meng Xiaofeng. Truth Discovery Based Credibility of Data Categories on Data Sources[J]. Journal of Computer Research and Development, 2015, 52(9): 1931-1940.)
[15] 马如霞, 孟小峰, 王璐, 等. MTruths: Web信息多真值发现方法[J]. 计算机研究与发展, 2016, 53(12): 2858-2866.
[15] (Ma Ruxia, Meng Xiaofeng, Wang Lu, et al. MTruths: An Approach of Multiple Truths Finding from Web Information[J]. Journal of Computer Research and Development, 2016, 53(12): 2858-2866.)
[16] Canalle G K, Salgado A C, Loscio B F. A Survey on Data Fusion: What for? in What Form? What is Next?[J]. Journal of Intelligent Information Systems, 2021, 57(1): 25-50.
doi: 10.1007/s10844-020-00627-4
[17] 卢菁, 胡成, 刘丛. 利用属性集相关性与源误差的多真值发现方法研究[J]. 小型微型计算机系统, 2019, 40(3): 601-605.
[17] (Lu Jing, Hu Cheng, Liu Cong. Research on Multi-Truth Discovery Using Attribute Set Correlation and Source Error[J]. Journal of Chinese Computer Systems, 2019, 40(3): 601-605.)
[18] Chen H F, Dong Y Q, Gu Q, et al. An End-to-End Deep Neural Network for Truth Discovery[C]// Proceedings of the International Conference on Web Information Systems and Applications. 2020: 377-387.
[19] Fang X S, Sheng Q Z, Wang X Z, et al. SmartVote: A Full-Fledged Graph-Based Model for Multi-Valued Truth Discovery[J]. World Wide Web, 2019, 22(4): 1855-1885.
doi: 10.1007/s11280-018-0629-3
[20] Lin X L, Chen L. Domain-Aware Multi-Truth Discovery from Conflicting Sources[J]. Proceedings of the VLDB Endowment, 2018, 11(5): 635-647.
doi: 10.1145/3187009.3177739
[1] 齐托托, 白如玉, 王天梅. 基于信息采纳模型的知识付费行为研究*——产品类型的调节效应[J]. 数据分析与知识发现, 2021, 5(12): 60-73.
[2] 姜雯, 许鑫. 在线问答社区信息质量评价研究综述[J]. 现代图书情报技术, 2014, 30(6): 41-50.
[3] 何远标, 乐小虬, 袁国华, 许丽媛, 管仲, 周强. 基于日志的泛在个人数据同步方法研究[J]. 现代图书情报技术, 2013, 29(10): 8-14.
[4] 沈旺, 国佳, 李贺. 网络社区信息质量及可靠性评价研究——基于用户视角[J]. 现代图书情报技术, 2013, 29(1): 69-74.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn