Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 173-181    DOI: 10.11925/infotech.2096-3467.2019.0643
Current Issue | Archive | Adv Search |
Finding Geographic Locations of Popular Online Topics
Liu Yuwen1,2(),Wang Kai1
1School of Health Management, Bengbu Medical College, Bengbu 233030, China
2College of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China
Download: PDF(920 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper analyzes the geographic distributions of popular online topics, aiming to provide decision-making support for public opinion management and social governance.[Methods] First, we introduced location parameters of comments into the LDA model, and proposed a region-oriented topic recognition model (RO-LDA). Then, we used this model to label texts, topics, locations and vocabularies with location tags. Third, we created text-topics, topic-words and topic-locations matrices. Finally, we identified trending topics and their geographic distributions with the help of topic-words and topic-locations distributions.[Results] We examined the proposed model with real data set. The F value reached 80.05%, which is higher than the existing models.[Limitations] The location tags were set manually, which impacted the accuracy of region recognition.[Conclusions] The proposed method could identify geographic features of trending topics effectively.

Key wordsRegion      Network Topic      Hot Events      RO-LDA Model      Topic Recognition     
Received: 11 June 2019      Published: 26 April 2020
ZTFLH:  G210.7  
Corresponding Authors: Yuwen Liu     E-mail: lywzyfy@163.com

Cite this article:

Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 173-181.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0643     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I2/3/173

10]
">
Graphical Representation of LDA Model[10]
Graphical Representation of RO-LDA Model
变量 说明
α 文本-话题矩阵A的超参
β 话题-词汇矩阵B的超参
η (话题,地域)-位置矩阵H的超参
A 文档-话题矩阵
B 话题-词汇矩阵
H (话题,地域)-位置矩阵
z 话题
l (话题,地域)
w 文本词汇
r 词汇的位置
K 话题数量
N 语料库中词汇总数
M 语料库中文本数量
G 文本中位置标签数量
Intruction About Variables in RO-LDA
序号 话题特征词及生成概率 话题位置及生成概率
1 危险0.00095; 废物0.00086; 垃圾0.00083; 罚款0.00075;
成都0.00072; 10万0.00071; 分类0.00069; 规定0.00069;
收集点0.00066; 混入0.00065; 生活0.00063; 新规0.00063;
单位0.00061; 个人0.00061; 5月0.00059;
r134 0.032; r141 0.030; r136 0.030; r137 0.029; r143 0.029; r156 0.027; r144 0.025; r139 0.025; r152 0.023; r146 0.022; r138 0.022; r141 0.021;r145 0.020; r135 0.020; r142 0.019; r146 0.017; r149 0.017; r148 0.015;
2 机动车0.00103; 交通0.00098; 违法0.00096; 行为0.00091;
天津0.00086; 项0.00086; 举报0.00085; 奖励0.00077;
影响0.00073; 20万0.00069; 行驶0.00068; 事故0.00062;
道路0.00062; 每起0.00057; 安全0.00055;
r196 0.027; r200 0.025; r201 0.025; r205 0.023; r195 0.020; r203 0.020; r194 0.017; r210 0.017; r216 0.016; r212 0.016; r208 0.016; r197 0.015;
r199 0.015; r215 0.013; r221 0.012; r190 0.012; r207 0.011; r211 0.011;
3 网约车0.00062; 交通0.00061; 安全0.00057; 道路0.00056;
条例0.00052; 平台0.00052; 处罚0.00051; 派单0.00049;
南京0.00049; 面临0.00046; 公司0.00044; 治理0.00043;
乘客0.00043; 合法0.00041; 监管0.00041;
r108 0.031; r103 0.028; r112 0.028; r105 0.027; r115 0.025; r116 0.023; r101 0.022; r120 0.022; r117 0.020; r113 0.019; r100 0.019; r120 0.018;
r98 0.015; r108 0.015; r102 0.012; r122 0.012; r111 0.010; r106 0.010;
4 医院0.00051; 三甲0.00051; 顺序0.00049; 急症0.00049;
先来后到0.00046; 急诊0.00046; 分级0.00045;
北京0.00044; 专业0.00039; 就诊0.00038; 优先0.00038;
危重0.00036; 患者0.00033; 医护0.00033; 改变0.00032;
r220 0.022; r219 0.022; r217 0.018; r218 0.018; r225 0.017; r223 0.017; r230 0.016; r237 0.015; r231 0.015; r229 0.014; r222 0.014; r225 0.012;
r232 0.012; r226 0.011; r228 0.011; r235 0.011; r233 0.011; r227 0.010;
5 小学0.00151; 上饶0.00144; 杀人0.00136; 刀0.00128;
班主任0.00119; 刘帅0.00111; 血0.00104; 何琛0.00102;
老师0.00101; 王某建0.00101; 第五0.00098; 语文0.00096;
卫生间0.00085; 医生0.00077; 校长0.00068;
r88 0.019; r87 0.019; r85 0.018; r92 0.018; r83 0.018; r77 0.018;
r134 0.017; r219 0.016; r75 0.016; r70 0.016; r8 0.015; r97 0.015;
r152 0.015; r146 0.014; r160 0.014; r141 0.014; r2 0.013; r179 0.013;
6 保险0.00085; 养老0.00085; 城镇0.00083; 职工0.00083;
人社部0.0081; 比例0.0080; 缴费0.00080; 医疗费0.00077;
单位0.00075; 降低0.00072; 社保0.00068; 失业0.00067;
调整0.00061; 工伤0.00058; 政策0.00057;
r220 0.020; r134 0.018; r196 0.018; r108 0.017; r223 0.017; r231 0.016; r146 0.016; r70 0.016; r77 0.016; r219 0.015; r205 0.015; r108 0.015;
r37 0.015; r6 0.015; r194 0.015; r207 0.014; r69 0.014; r118 0.014;
7 西甲0.00078; 武磊0.00077; 西班牙0.00073; 跑位0.00071;
吹0.0071; 希望0.00068; 首发0.00066; 足球0.00065;
球王0.00065; 单刀 0.0063; 中国0.00060; 欧战0.00060;
孤立0.00059; 速度0.00059; 替换0.0056;
r2 0.025; r8 0.025; r219 0.025; r223 0.023; r141 0.023; r71 0.023; r78 0.023; r169 0.022; r38 0.022; r227 0.022; r188 0.022; r192 0.022;
r49 0.021; r201 0.021; r105 0.021; r83 0.012; r152 0.020; r78 0.019;
8 五一0.00131; 爆满0.00130; 旅游0.00127; 酒店0.00126;
西湖0.00122; 北京0.00121; 客流0.00117; 飞机0.00112;
携程0.00111; 黄山0.00108; 高峰0.00108; 出境0.00099;
景区0.00092; 游客0.00087; 人多0.00085;
r86 0.023; r219 0.023; r16 0.022; r25 0.022; r133 0.022; r217 0.022;
r156 0.021; r193 0.021; r158 0.021; r112 0.021; r104 0.021; r51 0.020;
r28 0.020; r163 0.020; r179 0.020; r199 0.019; r46 0.019; r229 0.017;
Recognition Results About Feature Words and Positions of Topics
序号 位置
编号
位置
名称
话题
强度
序号 位置
编号
位置
名称
话题强度
1 r134 锦江区 0.11 10 r139 双流区 0.05
2 r141 青羊区 0.10 11 r152 金堂县 0.04
3 r136 金牛区 0.10 12 r146 郫县 0.04
4 r137 武侯区 0.09 13 r138 大邑县 0.04
5 r143 成华区 0.08 14 r141 浦江县 0.03
6 r156 龙泉驿区 0.07 15 r145 新津县 0.03
7 r156 青白江区 0.06 16 r135 广汉市 0.02
8 r140 新都区 0.05 17 r149 简阳市 0.01
9 r144 温江区 0.05 18 r148 崇州市 0.01
Position Mapping Results of Topic 1 and Its Strength in Position
序号 位置
编号
实际
名称
话题
强度
序号 位置
编号
位置
名称
话题
强度
1 r88 信州区 0.08 10 r70 西湖区 0.05
2 r87 广丰区 0.08 11 r8 白云区 0.05
3 r85 上饶县 0.08 12 r97 蜀山区 0.05
4 r92 南昌县 0.06 13 r152 金水区 0.05
5 r83 青山湖区 0.06 14 r146 黄陂区 0.04
6 r77 浦东新区 0.05 15 r160 万州区 0.04
7 r134 朝阳区 0.05 16 r141 鼓楼区 0.04
8 r219 海淀区 0.05 17 r2 福田区 0.04
9 r75 闵行区 0.05 18 r179 章丘区 0.04
Position Mapping Results of Topic 5 and Its Strength in Position
Topics Strength Comparison
Density for Regional Topics and Wide Topics
数据集 TF-IDF LDA CNN-TTM WTM RO-LDA
准确率 74.65 75.32 73.98 77.17 82.15
召回率 75.73 78.41 78.62 81.58 78.06
F值 75.19 76.83 76.23 79.31 80.05
Performances of Models(%)
[1] Momtazi S . Unsupervised Latent Dirichlet Allocation for Supervised Question Classification[J]. Information Processing and Management, 2018,54(3):380-393.
[2] 徐月梅, 吕思凝, 蔡连侨 , 等. 结合卷积神经网络和Topic2Vec的新闻主题演变分析[J]. 数据分析与知识发现, 2018,2(9):31-41.
[2] ( Xu Yuemei, Lv Sining, Cai Lianqiao , et al. Analyzing News Topic Evolution with Convolutional Neural Networks and Topic2Vec[J]. Data Analysis and Knowledge Discovery, 2018,2(9):31-41.)
[3] Chen L, Zhang H Z, Jose J M , et al. Topic Detection and Tracking on Heterogeneous Information[J]. Journal of Intelligent Information Systems, 2018,51(1):115-137.
[4] 付鹏, 林政, 袁凤程 , 等. 基于卷积神经网络和用户信息的微博话题追踪模型[J]. 模式识别与人工智能, 2017,30(1):73-80.
[4] ( Fu Peng, Lin Zheng, Yuan Fengcheng , et al. Convolutional Neural Network and User Information Based Model for Microblog Topic Tracking[J]. Pattern Recognition and Artificial Intelligence, 2017,30(1):73-80.)
[5] 周亚东, 刘晓明, 杜友田 , 等. 一种网络话题的内容焦点迁移识别方法[J]. 计算机学报, 2015,38(2):261-271.
[5] ( Zhou Yadong, Liu Xiaoming, Du Youtian , et al. A Method for Identifying the Evolutionary Focuses of Online Social Topics[J]. Chinese Journal of Computers, 2015,38(2):261-271.)
[6] 何跃, 朱灿, 朱婷婷 , 等. 微博热点话题情感趋势研究[J]. 情报理论与实践, 2018,41(7):155-160.
[6] ( He Yue, Zhu Can, Zhu Tingting , et al. Research on the Emotional Tendency of Hot Topics in Micro-blogs[J]. Information Studies: Theory & Application, 2018,41(7):155-160.)
[7] 廖海涵, 王曰芬, 关鹏 . 微博舆情传播周期中不同传播者的主题挖掘与观点识别[J]. 图书情报工作, 2018,62(19):77-85.
[7] ( Liao Haihan, Wang Yuefen, Guan Peng . Topic Mining and Viewpoint Recognition of Different Communicators in the Transmission Cycle of Micro-blog Public Opinion[J]. Library and Information Service, 2018,62(19):77-85.)
[8] 余冲, 李晶, 孙旭东 , 等. 基于词嵌入与概率主题模型的社会媒体话题识别[J]. 计算机工程, 2017,43(12):184-191.
[8] ( Yu Chong, Li Jing, Sun Xudong , et al. Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model[J]. Computer Engineering, 2017,43(12):184-191.)
[9] 方小飞, 黄孝喜, 王荣波 , 等. 基于LDA模型的移动投诉文本热点话题识别[J]. 数据分析与知识发现, 2017,1(2):19-27.
[9] ( Fang Xiaofei, Huang Xiaoxi, Wang Rongbo , et al. Identifying Hot Topics from Mobile Complaint Texts[J]. Data Analysis and Knowledge Discovery, 2017,1(2):19-27.)
[10] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[11] 李平, 张路遥, 曹霞 , 等. 基于潜在主题的混合上下文推荐算法[J]. 电子与信息学报, 2018,40(4):957-963.
[11] ( Li Ping, Zhang Luyao, Cao Xia , et al. Hybrid Context Recommendation Algorithm Based on Latent Topic[J]. Journal of Electronics and Information Technology, 2018,40(4):957-963.)
[12] Zou Y P, Ouyang J H, Li X M . Supervised Topic Models with Weighted Words: Multi-Label Document Classification[J]. Frontiers of Information Technology & Electronic Engineering, 2018,19(4):513-523.
[13] 李维皓, 曹进, 李晖 . 基于位置服务隐私自关联的隐私保护方案[J]. 通信学报, 2019,40(5):57-66.
[13] ( Li Weihao, Cao Jin, Li Hui . Privacy Self-Correlation Privacy-Preserving Scheme in LBS[J]. Journal on Communications, 2019,40(5):57-66.)
[14] 鲜学丰, 崔志明, 赵朋朋 , 等. 基于主题模型的位置感知订阅发布系统[J]. 计算机科学, 2018,45(3):167-172.
[14] ( Xian Xuefeng, Cui Zhiming, Zhao Pengpeng , et al. Location-awareness Publication Subscription System Based on Topic Model[J]. Computer Science, 2018,45(3):167-172.)
[15] Twinandilla S, Adhy S, Surarso B , et al. Multi-Document Summarization Using K-Means and Latent Dirichlet Allocation (LDA)-Significance Sentences[J]. Procedia Computer Science, 2018,135:663-670.
[16] Chen M L, Wang Q, Li X L . Patch-based Topic Model for Group Detection[J]. Science China Information Sciences, 2017, 60: Article No. 113101.
[1] Bowen Liu,Rujiang Bai,Yanting Zhou,Xiaoyue Wang. Identifying Frontier Topics from Funding and Paper——Case Study of Carbon Nanotube[J]. 数据分析与知识发现, 2019, 3(8): 114-122.
[2] Wenfeng Si,Guangwei Hu. Examining E-Government Services of Chinese Cities with Geographical Regions, Government Channels and Administrative Dimensions[J]. 数据分析与知识发现, 2018, 2(9): 1-9.
[3] Chuanming Yu,Yajing Guo,Yutian Gong,Manyu Huang,Hufeng Peng. Evolution and Regional Differences of E-commerce Policies for Rural Poverty Reduction Based on Topic over Time Model[J]. 数据分析与知识发现, 2018, 2(7): 34-45.
[4] Yanfu Luo,Xiaodong Qian. Uncertain Data Clustering Algorithm Based on Local Density[J]. 数据分析与知识发现, 2017, 1(12): 84-91.
[5] Wu Xiaolan,Zhang Chengzhi. Analyzing Food Community with Recipes and Weibo User Reviews[J]. 现代图书情报技术, 2016, 32(6): 54-62.
[6] Shao Jian, Zhang Chengzhi, Li Lei. Survey on Hashtag Mining and Its Application[J]. 现代图书情报技术, 2015, 31(10): 40-49.
[7] Su Jinyan. Regional Tendencies of Research Collaboration of Social Sciences in China——Analysis Based on Papers of Economic Journals[J]. 现代图书情报技术, 2013, 29(10): 43-52.
[8] Wang Kai,Wang Chaofei. A Table Retrieval Algorithm Based on the Vector Space Model[J]. 现代图书情报技术, 2010, 26(4): 41-45.
[9] Lin Lin. JPEG2000 and the Application of It s Regions of Interest(ROI) of the Image Encoded/Decoded in the Digital Library[J]. 现代图书情报技术, 2003, 19(3): 30-32.
[10] Huang Qi,Yuan Qinjian,Shao Bo. Current Status and Thinking on Some Problems of WWW Information Resources Construction of Beijing Regional Network of CSTNet[J]. 现代图书情报技术, 2000, 16(4): 36-40.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn