Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (12): 99-112     https://doi.org/10.11925/infotech.2096-3467.2022.0127
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于改进D-S证据理论的微博不可信用户识别研究*
徐建民1,王恺霖1,吴树芳2()
1河北大学网络空间安全与计算机学院 保定 071002
2河北大学管理学院 保定 071002
Identifying Untrusted Weibo Users Based on Improved Dempster-Shafer Evidence Theory
Xu Jianmin1,Wang Kailin1,Wu Shufang2()
1College of Cyberspace Security and Computer, Hebei University, Baoding 071002, China
2College of Management, Hebei University, Baoding 071002, China
全文: PDF (1641 KB)   HTML ( 18
输出: BibTeX | EndNote (RIS)      
摘要 

目的】 利用改进的D-S证据理论实现含主观不确定性的微博不可信用户识别。【方法】 基于证据距离改进D-S证据理论,依据该理论将微博用户历史博文的可信度转化为证据,融合证据生成用户的信任区间。在此基础上,利用决策树算法实现对不可信用户的识别。【结果】 与当前认可度较高的不可信用户识别方法相比,本文提出的方法时间消耗最多减少287.4秒, F 1值最多提高31.9个百分点,一致性检验的卡方值最优。【局限】 仅考虑时间衰减、证据冲突带来的主观不确定性,未考虑认知差异对主观性的影响。【结论】 基于改进的D-S证据理论进行微博不可信用户识别,能够提升识别效果。

方法

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐建民
王恺霖
吴树芳
关键词 微博不可信用户主观不确定性D-S证据理论    
Abstract

[Objective] This paper modifies the Dempster-Shafer evidence theory, aiming to identify untrusted Sina Weibo (Microblog) users with subjective uncertainties. [Methods] Firstly, we used the evidence distance to improve the original Dempster-Shafer evidence theory. Then, we transformed the credibility of historical posts into evidence, which was also merged to generate users’ trust interval. Finally, we identified untrusted users with the Decision Tree algorithm and the trust interval. [Results] Compared with the existing methods, our new model reduced the processing time by 287.4 seconds, increased the F 1 value by 31.9 percentage point, and received an optimal Chi-Square value of the consistency test. [Limitations] We only investigated the subjective uncertainties due to time decay and evidence conflict, and need to add the impacts of cognitive differences on subjective degrees. [Conclusions] The proposed method could effectively identify untrusted users from Sina Weibo.

Key wordsMicroblog    Untrusted Users    Subjective Uncertainty    Dempster-Shafer Evidence Theory
收稿日期: 2022-02-17      出版日期: 2023-02-03
ZTFLH:  G203  
  TP182  
基金资助:*国家社会科学基金一般项目(17BTQ068);河北省人文社会科学研究重大课题攻关项目(ZD202102)
通讯作者: 吴树芳,ORCID:0000-0002-9885-6944     E-mail: shufang_44@126.com
引用本文:   
徐建民, 王恺霖, 吴树芳. 基于改进D-S证据理论的微博不可信用户识别研究*[J]. 数据分析与知识发现, 2022, 6(12): 99-112.
Xu Jianmin, Wang Kailin, Wu Shufang. Identifying Untrusted Weibo Users Based on Improved Dempster-Shafer Evidence Theory. Data Analysis and Knowledge Discovery, 2022, 6(12): 99-112.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0127      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I12/99
Fig.1  基于改进D-S证据理论的微博不可信用户识别研究框架
Table 1  不可信博文示例
Table 2  Emoji映射表片段
Table 3  Emoji符号转换示例
A 0 = { θ 0 } A 1 = { θ 1 } A 2 = { θ 0 , θ 1 } A 3 = ?
m1 1.0 0.0 0.0 0.0
m2 0.0 1.0 0.0 0.0
Mideal 0.0 0.0 1.0 0.0
M 不存在 不存在 不存在 0.0
Table 4  冲突证据融合示例
命题Aj 含义
A 0 ={用户可信} 微博用户可信
A 1 ={用户不可信} 微博用户不可信
A 2 ={用户可信,用户不可信} 无法判断微博用户的可信性
A 3 = ? 未进行可信性判断
Table 5  识别框架 Θ中的命题
Fig.2  不可信用户样本分布
识别方法 介绍
DS-DT 本文提出的基于改进D-S证据理论的不可信用户识别方法
E-N[4] 利用最小熵离散化和朴素贝叶斯算法实现虚假用户识别
DDTLS[16] 利用双层采样主动学习方法辅助实现虚假用户检测
Truser[11] 利用两阶段ISODATA聚类实现不可信用户挖掘
A-D[15] 基于情感倾向挖掘恶意煽动激进情绪的敏感节点
Table 6  对照方法介绍
信用
极低
信用
较低
信用
一般
信用
较好
信用
极好
总计
识别不可信数 u 0 u 1 u 2 u 3 u 4 u
实际不可信数 v 0 v 1 v 2 v 3 v 4 v
总计 u 0 + v 0 u 1 + v 1 u 2 + v 2 u 3 + v 3 u 4 + v 4 u + v
Table 7  微博用户信用评级-可信性列联表
识别方法 测试集1 测试集2 测试集3 测试集4 测试集5
DS-DT 422.7 407.3 389.8 428.5 411.3
E-N 453.5 431.0 428.6 447.1 426.1
DDTLS 687.7 673.1 677.2 703.8 692.0
Truser 653.9 650.4 632.3 677.4 655.0
A-D 604.4 581.7 573.0 611.8 596.1
Table 8  5种识别方法在5个测试集上的识别时间消耗
Fig.3  5种方法识别诈骗、色情、詈言、脚本型用户的F1值
Fig.4  5种方法识别诈骗、色情、詈言、脚本型用户的召回率
Fig.5  5种方法识别诈骗、色情、詈言、脚本型用户的精确率
识别方法 F 1 召回率 精确率
DS-DT 0.812 0.738 0.902
E-N 0.679 0.665 0.693
DDTLS 0.804 0.716 0.917
Truser 0.608 0.531 0.711
A-D 0.493 0.408 0.622
Table 9  F1值、召回率、精确率
方法 χ 2
DS-DT 533.65
E-N 688.15
DDTLS 563.46
Truser 579.21
A-D 756.87
Table 10  5种识别方法的 χ 2
[1] 中华人民共和国国家互联网信息办公室. 网络信息内容生态治理规定[EB/OL]. [2022-10-31]. http://www.cac.gov.cn/2019-12/20/c_1578375159509309.htm.
[1] (Cyberspace Administration of China. Regulations on Ecological Governance of Network Information Content[EB/OL]. [2022-10-31]. http://www.cac.gov.cn/2019-12/20/c_1578375159509309.htm. )
[2] Yu Z D, Yu H Q. Untrusted User Detection in Microblogs[C]// Proceedings of the 13th International Conference on Trust, Security and Privacy in Computing and Communications. IEEE, 2014: 558-564.
[3] Dempster A P. Upper and Lower Probabilities Induced by a Multivalued Mapping[J]. The Annals of Mathematical Statistics, 1967, 38(2): 325-339.
doi: 10.1214/aoms/1177698950
[4] Erşahin B, Aktaş Ö, Kılınç D, et al. Twitter Fake Account Detection[C]// Proceedings of the 2017 International Conference on Computer Science and Engineering(UBMK). IEEE, 2017: 388-392.
[5] Wu Y H, Fang Y Z, Shang S K, et al. A Novel Framework for Detecting Social Bots with Deep Neural Networks and Active Learning[J]. Knowledge-Based Systems, 2021, 211: 106525.
doi: 10.1016/j.knosys.2020.106525
[6] 梁晓贺, 田儒雅, 吴蕾, 等. 基于超网络的微博相似度及其在微博舆情主题发现中的应用[J]. 图书情报工作, 2020, 64(11): 77-86.
doi: 10.13266/j.issn.0252-3116.2020.11.009
[6] (Liang Xiaohe, Tian Ruya, Wu Lei, et al. Microblog Similarity Based on Super Network and Its Application in Microblog Public Opinion Topic Detection[J]. Library and Information Service, 2020, 64(11): 77-86.)
doi: 10.13266/j.issn.0252-3116.2020.11.009
[7] Mccord M, Chuah M. Spam Detection on Twitter Using Traditional Classifiers[C]// Proceedings of the 8th International Conference on Autonomic and Trusted Computing. Springer, 2011: 175-186.
[8] 陈慧敏, 金思辰, 林微, 等. 新冠疫情相关社交媒体谣言传播量化分析[J]. 计算机研究与发展, 2021, 58(7): 1366-1384.
[8] (Chen Huimin, Jin Sichen, Lin Wei, et al. Quantitative Analysis on the Communication of COVID-19 Related Social Media Rumors[J]. Journal of Computer Research and Development, 2021, 58(7): 1366-1384.)
[9] Jr Barbon S, Campos G F C, Tavares G M, et al. Detection of Human, Legitimate Bot, and Malicious Bot in Online Social Networks Based on Wavelets[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(1s): Article No.26.
[10] 贾俊杰, 段超强. 基于评分离散度的托攻击检测算法[J]. 计算机工程与科学, 2022, 44(3): 554-562.
[10] (Jia Junjie, Duan Chaoqiang. A Shilling Attack Detection Algorithm Based on Score Dispersion[J]. Computer Engineering & Science, 2022, 44(3): 554-562.)
[11] 何鹏, 吴浩, 曾诚, 等. Truser: 一种基于可信用户的服务推荐方法[J]. 计算机学报, 2019, 42(4): 851-863.
[11] (He Peng, Wu Hao, Zeng Cheng, et al. Truser: An Approach to Service Recommendation Based on Trusted Users[J]. Chinese Journal of Computers, 2019, 42(4): 851-863.)
[12] Alsmadi I, O’rien M J. How Many Bots in Russian Troll Tweets?[J]. Information Processing & Management, 2020, 57(6): 102303.
doi: 10.1016/j.ipm.2020.102303
[13] Gupta A, Lamba H, Kumaraguru P. $1.00 per RT #BostonMarathon #PrayForBoston:Analyzing Fake Content on Twitter[C]// Proceedings of the 2013 APWG eCrime Researchers Summit. IEEE, 2013: 1-12.
[14] Kagan D M, Elovichi Y, Fire M. Generic Anomalous Vertices Detection Utilizing a Link Prediction Algorithm[J]. Social Network Analysis and Mining, 2018, 8(1): 1-13.
doi: 10.1007/s13278-017-0479-5
[15] 王丹, 张海涛, 刘雅姝, 等. 微博舆情关键节点情感倾向分析及思想引领研究[J]. 图书情报工作, 2019, 63(4): 15-22.
doi: 10.13266/j.issn.0252-3116.2019.04.002
[15] (Wang Dan, Zhang Haitao, Liu Yashu, et al. Sentiment Analysis and Ideological Guidance of Key Nodes in Micro-Blog Public Opinion[J]. Library and Information Service, 2019, 63(4): 15-22.)
doi: 10.13266/j.issn.0252-3116.2019.04.002
[16] 谭侃, 高旻, 李文涛, 等. 基于双层采样主动学习的社交网络虚假用户检测方法[J]. 自动化学报, 2017, 43(3): 448-461.
[16] (Tan Kan, Gao Min, Li Wentao, et al. Two-Layer Sampling Active Learning Algorithm for Social Spammer Detection[J]. Acta Automatica Sinica, 2017, 43(3): 448-461.)
[17] Shafer G A. A Mathematical Theory of Evidence[J]. Technometrics, 1978, 20(1): 106.
[18] Zadeh L A. A Simple View of the Dempster-Shafer Theory of Evidence and Its Implication for the Rule of Combination[J]. AI Magazine, 1986, 7(2):85-90.
[19] Murphy C K. Combining Belief Functions When Evidence Conflicts[J]. Decision Support Systems, 2000, 29(1): 1-9.
doi: 10.1016/S0167-9236(99)00084-6
[20] Yager R R. On the Dempster-Shafer Framework and New Combination Rules[J]. Information Sciences, 1987, 41(2): 93-137.
doi: 10.1016/0020-0255(87)90007-7
[21] 徐鹏, 林森. 基于C4.5决策树的流量分类方法[J]. 软件学报, 2009, 20(10): 2692-2704.
doi: 10.3724/SP.J.1001.2009.03444
[21] (Xu Peng, Lin Sen. Internet Traffic Classification Using C4.5 Decision Tree[J]. Journal of Software, 2009, 20(10): 2692-2704.)
doi: 10.3724/SP.J.1001.2009.03444
[22] 沈旺, 代旺, 高雪倩, 等. 基于多重图的社交网络用户可信度评价方法研究——网络欺凌与隐私泄露视角[J]. 现代情报, 2020, 40(8): 27-37.
doi: 10.3969/j.issn.1008-0821.2020.08.004
[22] (Shen Wang, Dai Wang, Gao Xueqian, et al. Research on Credibility Evaluation Method of Social Network Users Based on Multigraph—Perspective on Cyberbullying and Privacy Disclosure[J]. Journal of Modern Information, 2020, 40(8): 27-37.)
doi: 10.3969/j.issn.1008-0821.2020.08.004
[23] 明弋洋, 刘晓洁. 基于短语级情感分析的不良信息检测方法[J]. 四川大学学报(自然科学版), 2019, 56(6): 1042-1048.
[23] Ming Yiyang, Liu Xiaojie. Sensitive Information Detection Based on Phrase-Level Sentiment Analysis[J]. Journal of Sichuan University(Natural Science Edition), 2019, 56(6): 1042-1048.)
[24] 付聪, 余敦辉, 张灵莉. 面向中文敏感词变形体的识别方法研究[J]. 计算机应用研究, 2019, 36(4): 988-991.
[24] (Fu Cong, Yu Dunhui, Zhang Lingli. Study on Identification Method for Change Form of Chinese Sensitive Words[J]. Application Research of Computers, 2019, 36(4): 988-991.)
[25] Jkiss. GitHub - jkiss/sensitive-words: 互联网常用敏感词库[DS/OL]. (2018-12-04). [2022-04-29]. https://github.com/jkiss/sensitive-words.
[26] Harris Z S. Distributional Structure[J]. WORD, 1954, 10(2-3): 146-162.
doi: 10.1080/00437956.1954.11659520
[27] 马超. 健康议题辟谣社群的类别构成与社群结构研究——基于多主体谣言协同治理的视角[J]. 情报杂志, 2019, 38(1): 96-105.
[27] (Ma Chao. Study on the Categories and Structure of Health Rumor Denials Community: From the Perspective of Rumor Cooperative Governance[J]. Journal of Intelligence, 2019, 38(1): 96-105.)
[28] 孙琛琛, 申德荣, 单菁, 等. WSR: 一种基于维基百科结构信息的语义关联度计算算法[J]. 计算机学报, 2012, 35(11): 2361-2370.
doi: 10.3724/SP.J.1016.2012.02361
[28] (Sun Chenchen, Shen Derong, Shan Jing, et al. WSR: A Semantic Relatedness Measure Based on Wikipedia Structure[J]. Chinese Journal of Computers, 2012, 35(11): 2361-2370.)
doi: 10.3724/SP.J.1016.2012.02361
[29] 孙全, 叶秀清, 顾伟康. 一种新的基于证据理论的合成公式[J]. 电子学报, 2000, 28(8): 117-119.
[29] (Sun Quan, Ye Xiuqing, Gu Weikang. A New Combination Rules of Evidence Theory[J]. Acta Electronica Sinica, 2000, 28(8): 117-119.)
[30] 陆文星, 梁昌勇, 丁勇. 一种基于证据距离的客观权重确定方法[J]. 中国管理科学, 2008, 16(6): 95-99.
[30] (Lu Wenxing, Liang Changyong, Ding Yong. A Method Determining the Objective Weights of Experts Based on Evidence Distance[J]. Chinese Journal of Management Science, 2008, 16(6): 95-99.)
[31] Jousselme A L, Grenier D, Bossé É. A New Distance Between Two Bodies of Evidence[J]. Information Fusion, 2001, 2(2): 91-101.
doi: 10.1016/S1566-2535(01)00026-4
[32] 毕文豪, 张安, 李冲. 基于新的证据冲突衡量的加权证据融合方法[J]. 控制与决策, 2016, 31(1): 73-78.
[32] (Bi Wenhao, Zhang An, Li Chong. Weighted Evidence Combination Method Based on New Evidence Conflict Measurement Approach[J]. Control and Decision, 2016, 31(1): 73-78.)
[33] 吴剑云, 胥明珠. 基于用户画像和视频兴趣标签的个性化推荐[J]. 情报科学, 2021, 39(1): 128-134.
[33] (Wu Jianyun, Xu Mingzhu. Video Personalized Recommendation Based on User Profile and Video Interest Tags[J]. Information Science, 2021, 39(1): 128-134.)
[34] 李烨, 王亚刚, 许晓鸣. 证据融合的聚焦与冲突处理研究[J]. 系统工程与电子技术, 2012, 34(6): 1113-1119.
[34] (Li Ye, Wang Yagang, Xu Xiaoming. Research on Convergence and Conflict Treatment in Evidence Fusion[J]. Systems Engineering and Electronics, 2012, 34(6): 1113-1119.)
[35] 吴宝, 池仁勇. 融入情感分析与用户热度的社交网络用户可信度量方法[J]. 系统科学与数学, 2021, 41(4): 1091-1107.
doi: 10.12341/jssms20251
[35] (Wu Bao, Chi Renyong. A Trusted Measurement Method for Social Network Users That Integrates Sentiment Analysis and User Popularity[J]. Journal of Systems Science and Mathematical Sciences, 2021, 41(4): 1091-1107.)
doi: 10.12341/jssms20251
[36] 赖茂生, 王琳, 李宇宁. 情报学前沿领域的调查与分析[J]. 图书情报工作, 2008, 52(3): 6-10.
[36] (Lai Maosheng, Wang Lin, Li Yuning. Survey and Analysis of the Frontiers in Information Science[J]. Library and Information Service, 2008, 52(3): 6-10.)
[37] Li K H, Huang Z, Cheng Y C, et al. A Maximal Figure-of-Merit Learning Approach to Maximizing Mean Average Precision with Deep Neural Network Based Classifiers[C]// Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2014: 4503-4507.
[38] 王重, 刘黎明. 拟合优度检验统计量的设定方法[J]. 统计与决策, 2010(5): 154-156.
[38] (Wang Chong, Liu Liming. Setting Method of Goodness of Fit Test Statistics[J]. Statistics & Decision, 2010(5): 154-156.)
[39] 杨宇. 多指标综合评价中赋权方法评析[J]. 统计与决策, 2006(13): 17-19.
[39] (Yang Yu. Evaluation and Analysis of Weighting Methods in Multi Index Comprehensive Evaluation[J]. Statistics & Decision, 2006(13): 17-19.)
[1] 边晓慧, 徐童. 重大突发公共卫生事件下的公众情感演进分析:基于新冠肺炎疫情的考察*[J]. 数据分析与知识发现, 2022, 6(7): 128-140.
[2] 安璐, 徐曼婷. 突发公共卫生事件情境下网民对政务微博信任度的测度*[J]. 数据分析与知识发现, 2022, 6(1): 55-68.
[3] 张梦瑶, 朱广丽, 张顺香, 张标. 基于情感分析的微博热点话题用户群体划分模型 *[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[4] 席运江, 杜蝶蝶, 廖晓, 仉学红. 基于超网络的企业微博用户聚类研究及特征分析*[J]. 数据分析与知识发现, 2020, 4(8): 107-118.
[5] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[6] 李铁军,颜端武,杨雄飞. 基于情感加权关联规则的微博推荐研究*[J]. 数据分析与知识发现, 2020, 4(4): 27-33.
[7] 梁艳平,安璐,刘静. 同类突发公共卫生事件微博话题共振研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 122-133.
[8] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9] 韩康康,徐建民,张彬. 融合用户兴趣和多维信任度的微博推荐*[J]. 数据分析与知识发现, 2020, 4(12): 95-104.
[10] 王晰巍,张柳,黄博,韦雅楠. 基于LDA的微博用户主题图谱构建及实证研究*——以“埃航空难”为例[J]. 数据分析与知识发现, 2020, 4(10): 47-57.
[11] 李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 *[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[12] 安璐,梁艳平. 突发公共卫生事件微博话题与用户行为选择研究*[J]. 数据分析与知识发现, 2019, 3(4): 33-41.
[13] 赵明清,武圣强. 基于微博情感分析的股市加权预测方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 43-51.
[14] 陈芬,高小欢,彭玥,何源,薛春香. 融合文本倾向性分析的微博意见领袖识别 *[J]. 数据分析与知识发现, 2019, 3(11): 120-128.
[15] 曾子明, 杨倩雯. 基于LDA和AdaBoost多特征组合的微博情感分析*[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn