Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (9): 59-65    DOI: 10.11925/infotech.2096-3467.2018.0273
Current Issue | Archive | Adv Search |
Clustering Policy Texts Based on LDA Topic Model
Zhang Tao1(), Ma Haiqun2
1Information and Network Center, Heilongjiang University, Harbin 150080, China
2Research Center of Information Resource Management, Heilongjiang University, Harbin 150080, China
Download: PDF (548 KB)   HTML ( 12
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This research aims to improve the effectiveness of clustering policy texts with the help of LDA topic model. [Methods] First, we pre-processed the policy texts with the LDA model to generate the training data set. Then, we used the weighted algorithm to determine the optimal number of topics and then clustered the policy texts. [Results] We found that the G value of the weighted clustering results reached peak while the k value was 4. Our results, which were consistent with those of the manual classification, also obtained higher purity and F values. Therefore, the proposed method is effective. [Limitations] Results of each operation in our study will influence the accuracy of the final policy text clustering. [Conclusions] The proposed method could provide directions for the making of new policies, the evaluation of current policies, and the mechanism of two-way interactions.

Key wordsPolicy Text      LDA      Topic Model      Text Clustering     
Received: 12 March 2018      Published: 25 October 2018
ZTFLH:  分类号: TP391  

Cite this article:

Zhang Tao,Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model. Data Analysis and Knowledge Discovery, 2018, 2(9): 59-65.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0273     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I9/59

主题1 主题2 主题3 主题4
旅游0.0213613 信息安全 0.0204932 传统 0.0323346 体育 0.0316222
城市0.0099187 数据 0.0168339 工艺 0.0225906 发展 0.0255651
发展0.0087745 信息 0.0133837 计划 0.0172757 工作 0.0160342
休闲0.0082023 技术 0.0126518 文物 0.0168327 建设 0.0097989
注意0.0076302 大数据 0.0116063 专业 0.0148396 体育产业 0.0096209
消费0.0066767 网络 0.0116063 传承人 0.0135109 社会 0.0075722
过程0.0061045 数据安全 0.0101426 技艺 0.0124036 国家 0.0071268
适应0.0059138 企业 0.0086788 青瓷 0.0115178 学校 0.0070377
环境0.0057231 系统 0.0085743 艺术 0.0104105 开展活动 0.0063251
[1] 裴雷, 孙建军, 周兆韬. 政策文本计算: 一种新的政策文本解读方式[J]. 图书与情报, 2016(6): 47-55.
[1] (Pei Lei, Sun Jianjun, Zhou Zhaotao.Policy Text Computing: A New Methodology of Policy Interpretation[J]. Library and Information, 2016(6): 47-55.)
[2] 黄如花. 信息检索[M].武汉: 武汉大学出版社, 2010.
[2] (Huang Ruhua.Information Retrieval[M]. Wuhan: Wuhan University Press, 2010.)
[3] Selvirt P R.Information Retrieval Models: A Survey[J]. International Journal of Research and Reviews in Information Sciences, 2012, 2(3): 227-233.
[4] 李江, 刘源浩, 黄萃, 等. 用文献计量研究重塑政策文本数据分析——政策文献计量的起源、迁移与方法创新[J]. 公共管理学报, 2015, 12(2): 138-144.
[4] (Li Jiang, Liu Yuanhao, Huang Cui, et al.Remolding the Policy Text Data Through Documents Quantitative Research: The Formation, Transformation and Method Innovation of Policy Documents Quantitative Research[J]. Journal of Public Management, 2015, 12(2): 138-144.)
[5] 刘刚, 刘影, 杜玉丹, 等. 一种政策语篇拟合度递归下降评估算法[J]. 计算机应用研究, 2015, 32(2): 343-346.
[5] (Liu Gang, Liu Ying, Du Yudan, et al.Recursive Descent Evaluation Algorithm on Policy Context Similarity[J]. Application Research of Computers, 2015, 32(2): 343-346.)
[6] Kar M, Nunes S, Ribeiro C.Summarization of Changes in Dynamic Text Collections Using Latent Dirichlet Allocation Model[J]. Information Processing and Management, 2015, 51(6): 809-833.
doi: 10.1016/j.ipm.2015.06.002
[7] Hofmann T.Unsupervised Learning by Probabilistic Latent Semantic Analysis[J]. Machine Learning, 2001, 42(1): 177-196.
doi: 10.1023/A:1007617005950
[8] Deerwester S, Dumais S T, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
doi: 10.1002/(ISSN)1097-4571
[9] Blei D M, Ng A Y, Jordan M.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2001,3: 993-1022.
[10] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
doi: 10.1073/pnas.0307752101
[11] 曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
doi: 10.3321/j.issn:0254-4164.2008.10.012
[11] (Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1781-1787.)
doi: 10.3321/j.issn:0254-4164.2008.10.012
[12] 王鹏, 高铖, 陈晓美. 基于LDA模型的文本聚类研究[J]. 情报科学, 2015, 33(1): 63-68.
[12] (Wang Peng, Gao Cheng, Chen Xiaomei.Research on LDA Model Based on Text Clustering[J]. Information Science, 2015, 33(1): 63-68.)
[13] 王婷婷, 韩满, 王宇. LDA模型的优化及其主题数量选择研究——以科技文献为例[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[13] (Wang Tingting, Han Man, Wang Yu.Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. Data Analysis and Knowledge Discovery, 2018, 2(1): 29-40.)
[14] 阮光册, 夏磊. 基于主题模型的检索结果聚类应用研究[J]. 情报杂志, 2017, 36(3): 179-184.
[14] (Ruan Guangce, Xia Lei.Research on Clustering of Retrieval Results Based on Topic Model[J]. Journal of Intelligence, 2017, 36(3): 179-184.)
[15] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[15] (Guan Peng, Wang Yuefen.Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[16] 吕亚伟, 李芳, 戴龙龙. 基于LDA 的中文词语相似度计算[J]. 北京化工大学学报: 自然科学版, 2016, 43(5): 79-83.
[16] (Lv Yawei, Li Fang, Dai Longlong.Chinese Word Similarity Computing Based on the Latent Dirichelet Allocation(LDA) Model[J]. Journal of Beijing University of Chemical Technology: Natural Science, 2016, 43(5): 79-83.)
[17] 李湘东, 巴志超, 黄莉. 一种基于加权LDA模型和多粒度的文本特征选择方法[J]. 现代图书情报技术, 2015(5): 42-49.
[17] (Li Xiangdong, Ba Zhichao, Huang Li.A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity[J]. New Technology of Library and Information Service, 2015(5): 42-49.)
[18] 张惟皎, 刘春煌, 李芳玉. 聚类质量的评价方法[J]. 计算机工程, 2005, 31(20): 10-12.
[18] (Zhang Weijiao, Liu Chunhuang, Li Fangyu.Method of Quality Evaluation for Clustering[J]. Computer Engineering, 2005, 31(20): 10-12.)
[19] 邓汉成, 王敏芳, 王瑛. 查全率与查准率之间关系的理论研究[J]. 情报学报, 2000, 19(4): 359-362.
[19] (Deng Hancheng, Wang Minfang, Wang Ying.Theoretical Study of the Relationship Between Recall and Precision Ratio[J]. Journal of the China Society for Scientific and Technical Information, 2000, 19(4): 359-362.)
[20] ICTCLAS2016[EB/OL].[2017-03-14]. .
[21] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011, 47(13): 150-153.
doi: 10.3778/j.issn.1002-8331.2011.13.043
[21] (Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
doi: 10.3778/j.issn.1002-8331.2011.13.043
[22] 方小飞, 黄孝喜, 王荣波, 等. 基于LDA模型的移动投诉文本热点话题识别[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[22] (Fang Xiaofei, Huang Xiaoxi, Wang Rongbo, et al.Identifying Hot Topics from Mobile Complaint Texts[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 19-27.)
[1] Li Yueyan,Wang Hao,Deng Sanhong,Wang Wei. Research Trends of Information Retrieval——Case Study of SIGIR Conference Papers[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[2] Yi Huifang,Liu Xiwen. Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[3] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[4] Zhang Xin,Wen Yi,Xu Haiyun. A Prediction Model with Network Representation Learning and Topic Model for Author Collaboration[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[5] Zhao Tianzi, Duan Liang, Yue Kun, Qiao Shaojie, Ma Zijuan. Generating News Clues with Biterm Topic Model[J]. 数据分析与知识发现, 2021, 5(2): 1-13.
[6] Zheng Xinman, Dong Yu. Constructing Degree Lexicon for STI Policy Texts[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[7] Wang Wei, Gao Ning, Xu Yuting, Wang Hongwei. Topic Evolution of Online Reviews for Crowdfunding Campaigns[J]. 数据分析与知识发现, 2021, 5(10): 103-123.
[8] Chen Hao, Zhang Mengyi, Cheng Xiufeng. Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[9] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[10] Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[11] Ye Guanghui,Zeng Jieyan,Hu Jinglan,Bi Chongwu. Analyzing Public Sentiments from the Perspective of City Profiles[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
[12] Pan Youneng,Ni Xiuli. Recommending Online Medical Experts with Labeled-LDA Model[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[13] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[14] Xu Jianmin,Zhang Liqing,Wang Miao. Tracking Static Topics with Bayesian Network[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[15] Ye Guanghui,Xu Tong,Bi Chongwu,Li Xinyue. Analyzing Evolution of City Tourism Portraits with Multi-Dimensional Features and LDA Model[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn