Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (2): 19-27    DOI: 10.11925/infotech.2096-3467.2017.02.03
Orginal Article Current Issue | Archive | Adv Search |
Identifying Hot Topics from Mobile Complaint Texts
Fang Xiaofei1, Huang Xiaoxi1(), Wang Rongbo1, Chen Zhiqun1, Wang Xiaohua1,2
1Department of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
2China Jiliang University, Hangzhou 310018, China
Download: PDF (726 KB)   HTML ( 37
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to extract valuable information from large amount of complaint texts with the help of Chinese message processing technologies. [Methods] First, we analyzed the characteristics of the complaint texts, and then clustered them by k-means algorithm. Second, we extracted topics from the texts of each category with the LDA model. In the mean time, we calculated the weight of the word of each topic, as well as the mean of document probability distribution. Third, we analyzed topics with the highest means and used the document supporting rates to identify the trending ones. [Results] The document supporting rates of the topics extracted by this study was three times higher than the average ones. [Limitations] We did not investigate the semantic relationship among the topics. [Conclusions] The LDA model is an effective method to detect hot topics of the mobile complaints and indicates some future studies.

Key wordsMobile Complaints      k-means      Topic Detection      LDA Model     
Received: 10 November 2016      Published: 27 March 2017
ZTFLH:  TP391  

Cite this article:

Fang Xiaofei,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Wang Xiaohua. Identifying Hot Topics from Mobile Complaint Texts. Data Analysis and Knowledge Discovery, 2017, 1(2): 19-27.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.02.03     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I2/19

模型 扩张方式 实现方式 优势 局限性
LDA[1] 直接使用 无需监督 主题挖掘不理想
基于用户聚集LDA[3] 过程扩展 文本聚集 解决短文本问题 只限微博用户层面建模, 需要人工干预
基于训练USER模式[4] 过程扩展 文本聚集、
分步求解
解决短文本问题,
简化推导
需要事先训练和人工干预, 若要更新
模型需重新训练基
ATM[5] 模型扩展 文本聚集 解决短文本问题 只限微博用户层面主题建模
ATM扩展模型[12] 模型扩展 文本聚集 解决短文本问题 帖子层面主题少且不理想
Twitter-LDA[6, 13] 模型扩展 文本聚集,
引入背景模型
解决短文本问题和高频
词汇问题
一个帖子只能对应一个主题
Labeled-LDA[7, 14] 模型扩展 引入标签信息 提高主题可解释性 要求文本具有足够的标签信息
MB-LDA[8] 模型扩展 引入结构化信息 解决短文本问题, 提高
主题可解释性
主要针对会话类和转发类中文微博
HLDA[9] 模型扩展 引入微博评论数、
转发数等特征量
提高主题可解释性 主要针对具有高评论数和转发数的微博
MA-LDA[10] 模型扩展 引入时间特征 解决短文本问题, 提高
主题可解释性
主要适应于短时间内被普遍关注的微博
词语 词频 词性
短信费用 1 000 n(名词)
欠费停机 2 000 n
上网费用 2 000 n
有线宽带 2 000 n
畅玩游戏包 500 n
爱动漫信息费 3 000 n
夜间流量 28 641 n
原始
文本
诉求: 用户来电表示自己的手机(18067938538 )自己很少上网, 为什么在(2015-03 )月份的手机上会超出(210.38 )兆的上网流量, 前台解释不使用是不会产生这个上网流量, 前台建议用户手机不使用的时候把数据流量的开关关闭, 用户可以登录网厅查询上网的详单, 前台解释用户不认可, 用户表示自己没有使用产生的流量费用自己也是不予承担, 烦请处理谢谢, 客户资产编号: 1-14PTKKXU
处理
结果
手机 上网 月份 手机 超出 上网流量 前台 解释 上网流量 前台 建议 手机 数据流量 开关 关闭 登录 网厅 查询 上网 详单 前台 解释 用户不认可 流量费用 不予 承担
类中文本条数区间 类个数
[0, 50] 20
[51, 100] 62
[101, 200] 88
[201, + ∞) 40
话题 词语及词概率
Topic 0th: 金额0.494 号码0.056 宽带0.042 接到0.028
核实0.028 收费0.015 订单0.015 114 0.015
显示0.015 退订0.015
Topic 1th: 违约金0.228 不认可0.122 无0.046 对此0.046
滞纳金0.031 卡里 0.031 用户不认可0.031
费用0.031 通知0.031 收费 0.016
Topic 2th: 违约金0.092 返利0.077 翼支付0.062 投诉 0.062
成功0.031 翼支付加油0.031 支付0.031
收到0.031 营业0.031 办理0.031
Topic 3th: 滞纳金0.33 欠费0.101 电话0.044 上门0.030
违约金0.030 加油0.030 平台0.030 前台0.015
怎么回事0.015 出票0.015
Topic 4th: 减免0.248 前台0.087 解释0.087 交易0.062
强烈要求0.038 账户0.025 情况0.025
营业厅0.025 无效0.013 号用0.013
相关内容 数值
话题个数 299
支持文档数均值 1760.81
最大值 8076
最小值 3
中位数 1288
%25分位数 424
%50分位数 1288
%70分位数 2628.5
热点话题 支持文档数 一般话题 支持文档数
账单 8 070 维修 1 643
用户不认可 6 797 包月费 1 058
副卡 6 733 违约金 526
短号 6 408 上门移机 428
路由器 6 342 服务质量 349
数据流量 5 360 一号双机 249
国内上网 4 848 翼支付 240
线路 4 262 租用 205
补卡 4 341 手机信号 55
宽带 3 225 彩铃 48
[1] David M B, John D L.Dynamic Topic Model[C]//Proceedings of the 23rd International Conference on M achine Learning. Pittsburgh. 2006: 113-120.
[2] 张培晶, 宋蕾. 基于LDA的微博文本主题建模方法研究述评[J]. 图书情报工作, 2012, 56(24): 120-126.
[2] (Zhang Peijing, Song Lei.Overview on Topic Modeling of Microblogs Text Based on LDA[J]. Library and Information Service, 2012, 56(24): 120-126.)
[3] Weng J, Lim E P, Jiang J, et al.TwitterRank: Finding Topic-sensitive Influential Twitterers[C]//Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. ACM, 2010: 261-270.
[4] Hong L, Davison B D.Empirical Study of Topic Modeling in Twitter[C]//Proceedings of the 1st Workshop on Social Media Analytics. ACM, 2010: 80-88.
[5] Rosen-Zvi M, Griffiths T, Steyvers M, et al.The Author- Topic Model for Authors and Documents[C]// Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2004: 487-494.
[6] Zhao W X, Jiang J, Weng J, et al.Comparing Twitter and Traditional Media Using Topic Models[C]// Proceedings of the 33rd European Conference on Information Retrieval. Springer Berlin Heidelberg, 2011: 338-349.
[7] Ramage D, Hall D, Nallapati R, et al.Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-labeled Corpora[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009: 248-256.
[8] 张晨逸, 孙建伶, 丁轶群. 基于MB-LDA模型的微博主题挖掘[J]. 计算机研究与发展, 2011, 48(10): 1795-1802.
[8] (Zhang Chenyi, Sun Jianling, Ding Yiqun.Topic Mining for Microblog Based on MB-LDA Model[J]. Journal of Computer Research and Development, 2011, 48(10): 1795-1802.)
[9] 唐晓波, 向坤. 基于LDA模型和微博热度的热点挖掘[J]. 图书情报工作, 2014, 58(5): 58-63.
doi: 10.13266/j.issn.0252-3116.2014.05.010
[9] (Tang Xiaobo, Xiang Kun.Hotspot Mining Based on LDA Model and Microblog Heat[J]. Library and Information Service, 2014, 58(5): 58-63.)
doi: 10.13266/j.issn.0252-3116.2014.05.010
[10] 朱颖. 基于微博的热点话题发现[D]. 重庆: 西南大学, 2014.
[10] (Zhu Ying.Hot Topic Extraction from Microblogs [D]. Chongqing: Southwest University, 2014.)
[11] 伍万坤, 吴清烈, 顾锦江. 基于EM-LDA综合模型的电商微博热点话题发现[J]. 现代图书情报技术, 2015(11): 33-40.
[11] (Wu Wankun, Wu Qinglie, Gu Jinjiang.Hot Topic Extraction from E-commerce Microblog Based on EM-LDA Integrated Model[J]. New Technology of Library and Information, 2015(11): 33-40.)
[12] Rosen-Zvi M, Chemudugunta C, Griffiths T, et al. Learning Author-topic Models from Text Corpora [J]. ACM Transactions on Information Systems, 2010, 28(1): Article No.4.
doi: 10.1145/1658377.1658381
[13] Zhao W X, Jiang J, He J, et al.Topical Key Phrase Extraction from Twitter[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 2011: 379-388.
[14] Ramage D, Dumais S T, Liebling D J.Characterizing Microblogs with Topic Models[C]//Proceedings of the 4th International Conference on Weblogs and Social Media. 2010.
[15] 吴夙慧, 成颖, 郑彦宁, 等. K-means算法研究综述[J]. 现代图书情报技术, 2011(5): 28-35.
[15] (Wu Suhui, Cheng Ying, Zheng Yanning, et al.Survey on K-means Algorithm[J]. New Technology of Library and Information Service, 2011(5): 28-35.)
[16] 朱成文, 李兵, 胡奎. HMM参数估计的Gibbs抽样算法[J]. 计算机工程与应用, 2012, 18(18): 57-60.
doi: 10.3778/j.issn.1002-8331.2012.18.012
[16] (Zhu Chengwen, Li Bing, Hu Kui.Algorithm of Parameter Estimation of HMM via Gibbs Sampling. Computer Engineering and Applications, 2012, 48(18): 57-60.)
doi: 10.3778/j.issn.1002-8331.2012.18.012
[17] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数的确定方法研究[J]. 现代图书情报技术, 2016, 32(9): 42-50.)
[17] (Guan Peng, Wang Yuefen.Identifying Optionan Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information, 2016, 32(9): 42-50.)
[18] 徐佳俊, 杨飏, 姚天昉, 等. 基于LDA模型的论坛热点话题识别和追踪[J]. 中文信息学报, 2016, 30(1): 43-50.
[18] (Xu Jiajun, Yang Yang, Yao Tianfang, et al.LDA Based Hot Topic Detection and Tracking for the Forum[J]. Journal of Chinese Information Processing, 2016, 30(1): 43-50.)
[19] 张良均, 王路, 谭立云, 等.Python 数据分析与挖掘实战[M]. 机械工业出版社, 2015.
[19] (Zhang Liangjun, Wang Lu, Tan Liyun, et al.Python Practice of Data Analysis and Mining [M]. Machinery Industry Press, 2015.)
[20] jieba [CP/OL].[2016-11-23]. .
[21] 哈尔滨工业大学停用词词典[OL]. [2016-11-23]. .
[21] (Stop Word Dictionary by Harbin Institute of Technology [OL]. [2016-11-23].
[22] JGibbLDA: A Java Implementation of Latent Dirichlet Allocation (LDA) Using Gibbs Sampling for Parameter Estimation and Inference [CP/OL]. [2016-11-23]. .
[1] Wu Xu,Chen Chunxu. Detecting Topics of Group Chats with Multiple Strategies[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
[2] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[3] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[4] Ye Guanghui,Xu Tong,Bi Chongwu,Li Xinyue. Analyzing Evolution of City Tourism Portraits with Multi-Dimensional Features and LDA Model[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[5] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[6] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[7] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[8] Gang Li,Sijing Chen,Jin Mao,Yansong Gu. Spatio-Temporal Comparison of Microblog Trending Topics on Natural Disasters[J]. 数据分析与知识发现, 2019, 3(11): 1-15.
[9] Xu Yanhua,Miao Yujie,Miao Lin,Lv Xueqiang. Generating HSK Writing Essays with LDA Model[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[10] Wang Li,Zou Lixue,Liu Xiwen. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[11] Wang Jingqi,Li Rui,Wu Huayi. The Evolution of Online Public Opinion Based on Spatial Autocorrelation[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[12] Liu Hongwei,Gao Hongming,Chen Li,Zhan Mingjun,Liang Zhouyang. Identifying User Interests Based on Browsing Behaviors[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[13] Jia Xiaoting,Wang Mingyang,Cao Yu. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[14] Liu Minghui. Risk Assessment of Civil Aviation Terrorism Based on K-means Clustering[J]. 数据分析与知识发现, 2018, 2(10): 21-26.
[15] Li Zhen,Ding Shengchun,Wang Nan. Identifying Topics of Online Public Opinion[J]. 数据分析与知识发现, 2017, 1(8): 18-30.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn