Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (5): 1-9    DOI: 10.11925/infotech.2096-3467.2020.0718
Current Issue | Archive | Adv Search |
Detecting Topics of Group Chats with Multiple Strategies
Wu Xu1,2,3(),Chen Chunxu1,2
1Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing 100876, China
2School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
3Beijing University of Posts and Telecommunications Library, Beijing 100876, China
Download: PDF (829 KB)   HTML ( 38
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to detect topics of continuous group chats with variou types of message, aiming to address the topic entanglement issue of group chats, and reduce the influence of sparse text features on clustering. [Methods] We proposed a detection model for group chat topics based on multi-strategies. This model solves topic crossover issue with topic sequences, and improves clustering results with data on users, time, and types of messages. [Results] We examined our model with plain texts of three group chats. The new method’s F value was 2.9%, 6.1% and 3.0% higher than those of the existing algorithms. The speed of our model is about 27.6%, 32.1% and 47.1% faster. This method also processed mixed types of data that cannot be handled by traditional algorithms, and the speed was improved by about 29.4%, 27.1%, and 22.5% respectively. [Limitations] We do not fully utilize the text features of group chat message and set too many thresholds for the algorithm. [Conclusions] The proposed method could identify group chat topics, and improve the efficiency of public opinion analysis.

Key wordsGroup Chat Message      Topic Detection      Short Text     
Received: 22 July 2020      Published: 27 May 2021
ZTFLH:  TP391  
Fund:The work is supported by the National Key Research and Development Plan(2017YFC0820603);the National Natural Science Foundation of China(62072488);the Natural Science Foundation of Beijing,China(4202064)
Corresponding Authors: Wu Xu     E-mail: wux@bupt.edu.cn

Cite this article:

Wu Xu,Chen Chunxu. Detecting Topics of Group Chats with Multiple Strategies. Data Analysis and Knowledge Discovery, 2021, 5(5): 1-9.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0718     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I5/1

Topic Sequence
数据集 消息数 参与的
用户数
话题数 平均
汉字数
无义消息占比
D 102 083 823 NA NA NA
D1 9 024 117 1 413 8.28 21.80%
D2 3 690 148 855 7.35 24.44%
D3 636 68 303 26.49 29.40%
Information of Dataset
算法 指标 D1 D2 D3
SPTSWKV Ttime 490 530 950
Tc 0.10 0.10 0.15
F 0.645 0.594 0.698
SPTSAI Tt 0.20 0.70 0.65
Tf 0.75 0.30 0.05
Ht 5 4 7
F 0.664 0.630 0.719
Experimental Results on Data Sets Without Nonsense Messages
Running Time of Two Methods
算法 指标 D1 D2 D3
SPTSAI Tt 0.65 0.75 0.20
Tf 0.15 0.15 0.70
Ht 2 4 4
F 0.859 0.801 0.881
Experimental Results on Data Sets with Nonsense Messages
Traverse Tf with Different Tt
Traverse Ht
[1] 中国互联网络信息中心. 第44次中国互联网络发展状况统计报告[R/OL]. (2019-08-30). [2020-04-10].https://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201908/t20190830_70800.htm.
[1] (China Internet Network Information Center. The 44th China Statistical Report on Internet Development[R/OL].(2019-08-30).[2020-04-10]. https://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201908/t20190830_70800.htm.)
[2] Uthus D C, Aha D W. Multiparticipant Chat Analysis: A Survey[J]. Artificial Intelligence, 2013, 199-200:106-121.
[3] Onan A, Koruko$\check{a}$lu S, Bulut H. Ensemble of Keyword Extraction Methods and Classifiers in Text Classification[J]. Expert Systems with Applications: An International Journal, 2016,57(C):232-247.
doi: 10.1016/j.eswa.2016.03.045
[4] Xie F, Wu X D, Zhu X Q . Efficient Sequential Pattern Mining with Wildcards for Keyphrase Extraction[J]. Knowledge Based Systems, 2017,115:27-39.
doi: 10.1016/j.knosys.2016.10.011
[5] Kang Y B, Haghigh P D, Burstein F . TaxoFinder: A Graph-based Approach for Taxonomy Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2016,28(2):524-536.
doi: 10.1109/TKDE.2015.2475759
[6] Sanchez-Pi N, Martí L, Garcia A C B. Improving Ontology-based Text Classification: An Occupational Health and Security Application[J]. Journal of Applied Logic, 2016,17(C):48-58.
doi: 10.1016/j.jal.2015.09.008
[7] Saleh A I, Al Rahmawy M F, Abulwafa A E. A Semantic Based Web Page Classification Strategy Using Multi-layered Domain Ontology[J]. World Wide Web, 2017,20(5):939-993.
doi: 10.1007/s11280-016-0415-z
[8] Wu D Z, Zhu H, Li G L , et al. An Efficient Wikipedia Semantic Matching Approach to Text Document Classification[J]. Information Sciences: An International Journal, 2017,393(C):15-28.
doi: 10.1016/j.ins.2017.02.009
[9] Agathangelou P, Katakis I, Koutoulakis I , et al. Learning Patterns for Discovering Domain-oriented Opinion Words[J]. Knowledge and Information Systems, 2018,55(1):45-77.
doi: 10.1007/s10115-017-1072-y
[10] Bandhakavi A, Wiratunga N, Padmanabhan D , et al. Lexicon Based Feature Extraction for Emotion Text Classification[J]. Pattern Recognition Letters, 2017,93:133-142.
doi: 10.1016/j.patrec.2016.12.009
[11] Manek A S, Shenoy P D, Mohan M C , et al. Aspect Term Extraction for Sentiment Analysis in Large Movie Reviews Using Gini Index Feature Selection Method and SVM Classifier[J]. World Wide Web, 2017,20(2):135-154.
doi: 10.1007/s11280-015-0381-x
[12] Chaturvedi I, Ong Y S, Tsang I W , et al. Learning Word Dependencies in Text by Means of a Deep Recurrent Belief Network[J]. Knowledge-Based Systems, 2016,108(C):144-154.
doi: 10.1016/j.knosys.2016.07.019
[13] Tommasel A, Godoy D . Short-text Feature Construction and Selection in Social Media Data: A Survey[J]. Artificial Intelligence Review, 2018,49(3):301-338.
doi: 10.1007/s10462-016-9528-0
[14] Pavlinek M, Podgorelec V . Text Classification Method Based on Self-training and LDA Topic Models[J]. Expert Systems with Applications: An International Journal, 2017,80(C):83-93.
doi: 10.1016/j.eswa.2017.03.020
[15] Qin Z C, Cong Y H, Wan T . Topic Modeling of Chinese Language Beyond a Bag-of-words[J]. Computer Speech and Language, 2016,40(C):60-78.
doi: 10.1016/j.csl.2016.03.004
[16] Zhang H, Zhong G Q . Improving Short Text Classification by Learning Vector Representations of Both Words and Hidden Topics[J]. Knowledge-Based Systems, 2016,102(C):76-86.
doi: 10.1016/j.knosys.2016.03.027
[17] Zuo Y, Zhao J C, Xu K . Word Network Topic Model: A Simple but General Solution for Short and Imbalanced Texts[J]. Knowledge and Information Systems, 2016,48(2):379-398.
doi: 10.1007/s10115-015-0882-z
[18] Pinheiro R H W, Cavalcanti G D C, Tsang I R. Combining Dissimilarity Spaces for Text Categorization[J]. Information Sciences: An International Journal, 2017, 406-407:87-101.
[19] Elnahrawy E. Log-based Chat Room Monitoring Using Text Categorization: A Comparative Study[C]// Proceedings of the 2002 International Conference on Information and Knowledge Sharing. 2002.
[20] Özyurt Ö, Köse C . Chat Mining: Automatically Determination of Chat Conversations’ Topic in Turkish Text Based Chat Mediums[J]. Expert Systems with Applications, 2010,37(12):8705-8710.
doi: 10.1016/j.eswa.2010.06.053
[21] Adams P H, Martell C H. Topic Detection and Extraction in Chat[C]// Proceedings of the 6th International Conference on Semantic Computing, 2008: 581-588.
[22] Wang L D, Oard D W. Context-based Message Expansion for Disentanglement of Interleaved Text Conversations[C]// Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009: 200-208.
[23] 李天彩, 王波, 席耀一. 基于多策略的短文本信息流会话抽取[J]. 计算机应用研究, 2016,33(4):997-1002.
[23] ( Li Tiancai, Wang Bo, Xi Yaoyi. Conversation Extraction in Short Text Message Streams Based on Multiple Strategies[J]. Application Research of Computers, 2016,33(4):997-1002.)
[24] 黄九鸣, 吴泉源, 刘春阳, 等. 短文本信息流的无监督会话抽取技术[J]. 软件学报, 2012,23(4):735-747.
[24] ( Huang Jiuming, Wu Quanyuan, Liu Chunyang, et al. Unsupervised Conversation Extraction in Short Text Message Streams[J]. Journal of Software, 2012,23(4):735-747.)
[25] Ding Y X, Meng X J, Chai G R, et al. User Identification for Instant Messages[C]// Proceedings of the 18th International Conference on Neural Information Processing. 2011: 11-13.
[26] Köse C, Özyurt Ö, İkibaş C. A Comparison of Textual Data Mining Methods for Sex Identification in Chat Conversations[C]// Proceedings of the 4th Asia Conference on Information Retrieval Technology. 2008: 638-643.
[27] Shen D, Yang Q, Sun J T, et al. Thread Detection in Dynamic Text Message Streams[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2006: 35-42.
[1] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[2] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[3] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[4] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[5] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[6] Zhiyong Tao,Xiaobing Li,Ying Liu,Xiaofang Liu. Classifying Short Texts with Improved-Attention Based Bidirectional Long Memory Network[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[7] Gang Li,Sijing Chen,Jin Mao,Yansong Gu. Spatio-Temporal Comparison of Microblog Trending Topics on Natural Disasters[J]. 数据分析与知识发现, 2019, 3(11): 1-15.
[8] Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[9] Fang Xiaofei,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Wang Xiaohua. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[10] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[11] Zhang Xiaoyong,Zhou Qingqing,Zhang Chengzhi. Identifying Food Topics from User-Generated Contents in Microblogs[J]. 现代图书情报技术, 2016, 32(10): 70-80.
[12] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[13] Zhang Qian, Liu Huailiang. An Algorithm of Short Text Classification Based on Semi-supervised Learning[J]. 现代图书情报技术, 2013, 29(2): 30-35.
[14] Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. 现代图书情报技术, 2012, 28(3): 47-52.
[15] Zhao Yingguang, An Xinying, Li Yong, Jia Xiaofeng. A Method for Detecting the Hot Topic of Literature Based on Lifecycle——A Case Study of Neoplasm Field[J]. 现代图书情报技术, 2012, (11): 86-91.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn