基于多策略的群聊话题检测技术<sup>*</sup>

doi:10.11925/infotech.2096-3467.2020.0718

数据分析与知识发现

2021, Vol. 5

Issue (5): 1-9 https://doi.org/10.11925/infotech.2096-3467.2020.0718

研究论文｜

本期目录 | 过刊浏览 | 高级检索

基于多策略的群聊话题检测技术^*

吴旭^1,^2,³(

),陈春旭^1,²

¹北京邮电大学可信分布式计算与服务教育部重点实验室北京 100876
²北京邮电大学网络空间安全学院北京 100876
³北京邮电大学图书馆北京 100876

Detecting Topics of Group Chats with Multiple Strategies

Wu Xu^1,^2,³(

),Chen Chunxu^1,²

¹Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing 100876, China
²School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
³Beijing University of Posts and Telecommunications Library, Beijing 100876, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (829 KB) HTML ( 47 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 更好地解决群聊话题纠缠的问题,减少稀疏文本特征对聚类的影响,实现对多类型消息混合的连续群聊信息的话题检测。【方法】 提出一种基于多策略的群聊话题检测技术,通过构建话题序列解决话题交叉,利用消息的用户、时间、类型等属性提升聚类效果。【结果】 本方法处理三份群聊记录样本的纯文本数据时的F值较对比算法分别提升2.9%、6.1%和3.0%,速度分别提高约27.6%、32.1%和47.1%。本方法还能处理传统算法无法应对的混合类型数据,且比处理对应的纯文本数据时的性能分别提升约29.4%、27.1%和22.5%。【局限】 对群聊消息文本特征的利用率不足,算法所设阈值过多。【结论】 本文方法能够在一定程度上提高群聊话题检测效果,并扩大了话题检测所能应对的消息类型的广度,提升了舆情分析效率。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	吴旭
	陈春旭

关键词 ：群聊消息, 话题检测, 短文本

Abstract：

[Objective] This paper tries to detect topics of continuous group chats with variou types of message, aiming to address the topic entanglement issue of group chats, and reduce the influence of sparse text features on clustering. [Methods] We proposed a detection model for group chat topics based on multi-strategies. This model solves topic crossover issue with topic sequences, and improves clustering results with data on users, time, and types of messages. [Results] We examined our model with plain texts of three group chats. The new method’s F value was 2.9%, 6.1% and 3.0% higher than those of the existing algorithms. The speed of our model is about 27.6%, 32.1% and 47.1% faster. This method also processed mixed types of data that cannot be handled by traditional algorithms, and the speed was improved by about 29.4%, 27.1%, and 22.5% respectively. [Limitations] We do not fully utilize the text features of group chat message and set too many thresholds for the algorithm. [Conclusions] The proposed method could identify group chat topics, and improve the efficiency of public opinion analysis.

Key words： Group Chat Message Topic Detection Short Text

收稿日期: 2020-07-22 出版日期: 2021-05-27

ZTFLH:

TP391

基金资助:*本文系国家重点研发计划基金项目(2017YFC0820603);国家自然科学基金项目(62072488);北京市自然科学基金项目的研究成果之一(4202064)

通讯作者: 吴旭 E-mail: wux@bupt.edu.cn

引用本文:

吴旭,陈春旭. 基于多策略的群聊话题检测技术^*[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
Wu Xu,Chen Chunxu. Detecting Topics of Group Chats with Multiple Strategies. Data Analysis and Knowledge Discovery, 2021, 5(5): 1-9.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0718 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I5/1

Fig.1 话题序列

Table 1 群聊数据集信息

Table 2 在去除无义消息的数据集上的实验结果

Fig.2 两种方法的运行时间

Table 3 在保留无义消息的数据集上的实验结果

Fig.3 在不同的T_t下对T_f进行遍历

Fig.4 遍历H_t

[1]	中国互联网络信息中心. 第44次中国互联网络发展状况统计报告[R/OL]. (2019-08-30). [2020-04-10].https://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201908/t20190830_70800.htm.
[1]	(China Internet Network Information Center. The 44th China Statistical Report on Internet Development[R/OL].(2019-08-30).[2020-04-10]. https://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201908/t20190830_70800.htm.)
[2]	Uthus D C, Aha D W. Multiparticipant Chat Analysis: A Survey[J]. Artificial Intelligence, 2013, 199-200:106-121.
[3]	Onan A, Koruko$\check{a}$lu S, Bulut H. Ensemble of Keyword Extraction Methods and Classifiers in Text Classification[J]. Expert Systems with Applications: An International Journal, 2016,57(C):232-247. doi: 10.1016/j.eswa.2016.03.045
[4]	Xie F, Wu X D, Zhu X Q . Efficient Sequential Pattern Mining with Wildcards for Keyphrase Extraction[J]. Knowledge Based Systems, 2017,115:27-39. doi: 10.1016/j.knosys.2016.10.011
[5]	Kang Y B, Haghigh P D, Burstein F . TaxoFinder: A Graph-based Approach for Taxonomy Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2016,28(2):524-536. doi: 10.1109/TKDE.2015.2475759
[6]	Sanchez-Pi N, Martí L, Garcia A C B. Improving Ontology-based Text Classification: An Occupational Health and Security Application[J]. Journal of Applied Logic, 2016,17(C):48-58. doi: 10.1016/j.jal.2015.09.008
[7]	Saleh A I, Al Rahmawy M F, Abulwafa A E. A Semantic Based Web Page Classification Strategy Using Multi-layered Domain Ontology[J]. World Wide Web, 2017,20(5):939-993. doi: 10.1007/s11280-016-0415-z
[8]	Wu D Z, Zhu H, Li G L , et al. An Efficient Wikipedia Semantic Matching Approach to Text Document Classification[J]. Information Sciences: An International Journal, 2017,393(C):15-28. doi: 10.1016/j.ins.2017.02.009
[9]	Agathangelou P, Katakis I, Koutoulakis I , et al. Learning Patterns for Discovering Domain-oriented Opinion Words[J]. Knowledge and Information Systems, 2018,55(1):45-77. doi: 10.1007/s10115-017-1072-y
[10]	Bandhakavi A, Wiratunga N, Padmanabhan D , et al. Lexicon Based Feature Extraction for Emotion Text Classification[J]. Pattern Recognition Letters, 2017,93:133-142. doi: 10.1016/j.patrec.2016.12.009
[11]	Manek A S, Shenoy P D, Mohan M C , et al. Aspect Term Extraction for Sentiment Analysis in Large Movie Reviews Using Gini Index Feature Selection Method and SVM Classifier[J]. World Wide Web, 2017,20(2):135-154. doi: 10.1007/s11280-015-0381-x
[12]	Chaturvedi I, Ong Y S, Tsang I W , et al. Learning Word Dependencies in Text by Means of a Deep Recurrent Belief Network[J]. Knowledge-Based Systems, 2016,108(C):144-154. doi: 10.1016/j.knosys.2016.07.019
[13]	Tommasel A, Godoy D . Short-text Feature Construction and Selection in Social Media Data: A Survey[J]. Artificial Intelligence Review, 2018,49(3):301-338. doi: 10.1007/s10462-016-9528-0
[14]	Pavlinek M, Podgorelec V . Text Classification Method Based on Self-training and LDA Topic Models[J]. Expert Systems with Applications: An International Journal, 2017,80(C):83-93. doi: 10.1016/j.eswa.2017.03.020
[15]	Qin Z C, Cong Y H, Wan T . Topic Modeling of Chinese Language Beyond a Bag-of-words[J]. Computer Speech and Language, 2016,40(C):60-78. doi: 10.1016/j.csl.2016.03.004
[16]	Zhang H, Zhong G Q . Improving Short Text Classification by Learning Vector Representations of Both Words and Hidden Topics[J]. Knowledge-Based Systems, 2016,102(C):76-86. doi: 10.1016/j.knosys.2016.03.027
[17]	Zuo Y, Zhao J C, Xu K . Word Network Topic Model: A Simple but General Solution for Short and Imbalanced Texts[J]. Knowledge and Information Systems, 2016,48(2):379-398. doi: 10.1007/s10115-015-0882-z
[18]	Pinheiro R H W, Cavalcanti G D C, Tsang I R. Combining Dissimilarity Spaces for Text Categorization[J]. Information Sciences: An International Journal, 2017, 406-407:87-101.
[19]	Elnahrawy E. Log-based Chat Room Monitoring Using Text Categorization: A Comparative Study[C]// Proceedings of the 2002 International Conference on Information and Knowledge Sharing. 2002.
[20]	Özyurt Ö, Köse C . Chat Mining: Automatically Determination of Chat Conversations’ Topic in Turkish Text Based Chat Mediums[J]. Expert Systems with Applications, 2010,37(12):8705-8710. doi: 10.1016/j.eswa.2010.06.053
[21]	Adams P H, Martell C H. Topic Detection and Extraction in Chat[C]// Proceedings of the 6th International Conference on Semantic Computing, 2008: 581-588.
[22]	Wang L D, Oard D W. Context-based Message Expansion for Disentanglement of Interleaved Text Conversations[C]// Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009: 200-208.
[23]	李天彩, 王波, 席耀一. 基于多策略的短文本信息流会话抽取[J]. 计算机应用研究, 2016,33(4):997-1002.
[23]	( Li Tiancai, Wang Bo, Xi Yaoyi. Conversation Extraction in Short Text Message Streams Based on Multiple Strategies[J]. Application Research of Computers, 2016,33(4):997-1002.)
[24]	黄九鸣, 吴泉源, 刘春阳, 等. 短文本信息流的无监督会话抽取技术[J]. 软件学报, 2012,23(4):735-747.
[24]	( Huang Jiuming, Wu Quanyuan, Liu Chunyang, et al. Unsupervised Conversation Extraction in Short Text Message Streams[J]. Journal of Software, 2012,23(4):735-747.)
[25]	Ding Y X, Meng X J, Chai G R, et al. User Identification for Instant Messages[C]// Proceedings of the 18th International Conference on Neural Information Processing. 2011: 11-13.
[26]	Köse C, Özyurt Ö, İkibaş C. A Comparison of Textual Data Mining Methods for Sex Identification in Chat Conversations[C]// Proceedings of the 4th Asia Conference on Information Retrieval Technology. 2008: 638-643.
[27]	Shen D, Yang Q, Sun J T, et al. Thread Detection in Dynamic Text Message Streams[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2006: 35-42.

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[3]	魏家泽,董诚,何彦青,刘志辉,彭柯芸. 基于均衡段落和分话题向量的新闻热点话题检测研究^*[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[4]	余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[5]	邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[6]	陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 ^*[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[7]	余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类^*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[8]	陶志勇,李小兵,刘影,刘晓芳. 基于双向长短时记忆网络的改进注意力短文本分类方法 ^*[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[9]	李心蕾, 王昊, 刘小敏, 邓三鸿. 面向微博短文本分类的文本向量化方法比较研究^*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[10]	张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法^*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[11]	张晓勇,周清清,章成志. 面向在线社交网络用户生成内容的饮食话题发现研究^*[J]. 现代图书情报技术, 2016, 32(10): 70-80.
[12]	李湘东, 曹环, 丁丛, 黄莉. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[13]	赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[14]	胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013, (6): 42-48.
[15]	张倩, 刘怀亮. 一种基于半监督学习的短文本分类方法[J]. 现代图书情报技术, 2013, 29(2): 30-35.

Viewed

Full text

Abstract

Cited

Shared

Discussed