Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (5): 1-9    DOI: 10.11925/infotech.2096-3467.2020.0718
Detecting Topics of Group Chats with Multiple Strategies
Wu Xu1,2,3(),Chen Chunxu1,2
1Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing 100876, China
2School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
3Beijing University of Posts and Telecommunications Library, Beijing 100876, China
[Objective] This paper tries to detect topics of continuous group chats with variou types of message, aiming to address the topic entanglement issue of group chats, and reduce the influence of sparse text features on clustering. [Methods] We proposed a detection model for group chat topics based on multi-strategies. This model solves topic crossover issue with topic sequences, and improves clustering results with data on users, time, and types of messages. [Results] We examined our model with plain texts of three group chats. The new method’s F value was 2.9%, 6.1% and 3.0% higher than those of the existing algorithms. The speed of our model is about 27.6%, 32.1% and 47.1% faster. This method also processed mixed types of data that cannot be handled by traditional algorithms, and the speed was improved by about 29.4%, 27.1%, and 22.5% respectively. [Limitations] We do not fully utilize the text features of group chat message and set too many thresholds for the algorithm. [Conclusions] The proposed method could identify group chat topics, and improve the efficiency of public opinion analysis.

Key wordsGroup Chat Message      Topic Detection      Short Text     
Received: 22 July 2020      Published: 27 May 2021
ZTFLH:  TP391  
Fund:The work is supported by the National Key Research and Development Plan(2017YFC0820603);the National Natural Science Foundation of China(62072488);the Natural Science Foundation of Beijing,China(4202064)
Wu Xu

Wu Xu,Chen Chunxu. Detecting Topics of Group Chats with Multiple Strategies. Data Analysis and Knowledge Discovery, 2021, 5(5): 1-9.

Topic Sequence
数据集 消息数 参与的
话题数 平均
D 102 083 823 NA NA NA
D1 9 024 117 1 413 8.28 21.80%
D2 3 690 148 855 7.35 24.44%
D3 636 68 303 26.49 29.40%
Information of Dataset
算法 指标 D1 D2 D3
SPTSWKV Ttime 490 530 950
Tc 0.10 0.10 0.15
F 0.645 0.594 0.698
SPTSAI Tt 0.20 0.70 0.65
Tf 0.75 0.30 0.05
Ht 5 4 7
F 0.664 0.630 0.719
Experimental Results on Data Sets Without Nonsense Messages
Running Time of Two Methods
算法 指标 D1 D2 D3
SPTSAI Tt 0.65 0.75 0.20
Tf 0.15 0.15 0.70
Ht 2 4 4
F 0.859 0.801 0.881
Experimental Results on Data Sets with Nonsense Messages
Traverse Tf with Different Tt
Traverse Ht
