Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (9): 59-65    DOI: 10.11925/infotech.2096-3467.2018.0273
Current Issue | Archive | Adv Search |
Clustering Policy Texts Based on LDA Topic Model
Tao Zhang1(),Haiqun Ma2
1Information and Network Center, Heilongjiang University, Harbin 150080, China
2Research Center of Information Resource Management, Heilongjiang University, Harbin 150080, China
Download: PDF(548 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This research aims to improve the effectiveness of clustering policy texts with the help of LDA topic model. [Methods] First, we pre-processed the policy texts with the LDA model to generate the training data set. Then, we used the weighted algorithm to determine the optimal number of topics and then clustered the policy texts. [Results] We found that the G value of the weighted clustering results reached peak while the k value was 4. Our results, which were consistent with those of the manual classification, also obtained higher purity and F values. Therefore, the proposed method is effective. [Limitations] Results of each operation in our study will influence the accuracy of the final policy text clustering. [Conclusions] The proposed method could provide directions for the making of new policies, the evaluation of current policies, and the mechanism of two-way interactions.

Key wordsPolicy Text      LDA      Topic Model      Text Clustering     
Received: 12 March 2018      Published: 25 October 2018

Cite this article:

Tao Zhang,Haiqun Ma. Clustering Policy Texts Based on LDA Topic Model. Data Analysis and Knowledge Discovery, 2018, 2(9): 59-65.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0273     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I9/59

[1] 裴雷, 孙建军, 周兆韬. 政策文本计算: 一种新的政策文本解读方式[J]. 图书与情报, 2016(6): 47-55.
[1] (Pei Lei, Sun Jianjun, Zhou Zhaotao.Policy Text Computing: A New Methodology of Policy Interpretation[J]. Library and Information, 2016(6): 47-55.)
[2] 黄如花. 信息检索[M].武汉: 武汉大学出版社, 2010.
[2] (Huang Ruhua.Information Retrieval[M]. Wuhan: Wuhan University Press, 2010.)
[3] Selvirt P R.Information Retrieval Models: A Survey[J]. International Journal of Research and Reviews in Information Sciences, 2012, 2(3): 227-233.
[4] 李江, 刘源浩, 黄萃, 等. 用文献计量研究重塑政策文本数据分析——政策文献计量的起源、迁移与方法创新[J]. 公共管理学报, 2015, 12(2): 138-144.
[4] (Li Jiang, Liu Yuanhao, Huang Cui, et al.Remolding the Policy Text Data Through Documents Quantitative Research: The Formation, Transformation and Method Innovation of Policy Documents Quantitative Research[J]. Journal of Public Management, 2015, 12(2): 138-144.)
[5] 刘刚, 刘影, 杜玉丹, 等. 一种政策语篇拟合度递归下降评估算法[J]. 计算机应用研究, 2015, 32(2): 343-346.
[5] (Liu Gang, Liu Ying, Du Yudan, et al.Recursive Descent Evaluation Algorithm on Policy Context Similarity[J]. Application Research of Computers, 2015, 32(2): 343-346.)
[6] Kar M, Nunes S, Ribeiro C.Summarization of Changes in Dynamic Text Collections Using Latent Dirichlet Allocation Model[J]. Information Processing and Management, 2015, 51(6): 809-833.
[7] Hofmann T.Unsupervised Learning by Probabilistic Latent Semantic Analysis[J]. Machine Learning, 2001, 42(1): 177-196.
[8] Deerwester S, Dumais S T, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[9] Blei D M, Ng A Y, Jordan M.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2001,3: 993-1022.
[10] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[11] 曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
[11] (Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1781-1787.)
[12] 王鹏, 高铖, 陈晓美. 基于LDA模型的文本聚类研究[J]. 情报科学, 2015, 33(1): 63-68.
[12] (Wang Peng, Gao Cheng, Chen Xiaomei.Research on LDA Model Based on Text Clustering[J]. Information Science, 2015, 33(1): 63-68.)
[13] 王婷婷, 韩满, 王宇. LDA模型的优化及其主题数量选择研究——以科技文献为例[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[13] (Wang Tingting, Han Man, Wang Yu.Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. Data Analysis and Knowledge Discovery, 2018, 2(1): 29-40.)
[14] 阮光册, 夏磊. 基于主题模型的检索结果聚类应用研究[J]. 情报杂志, 2017, 36(3): 179-184.
[14] (Ruan Guangce, Xia Lei.Research on Clustering of Retrieval Results Based on Topic Model[J]. Journal of Intelligence, 2017, 36(3): 179-184.)
[15] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[15] (Guan Peng, Wang Yuefen.Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[16] 吕亚伟, 李芳, 戴龙龙. 基于LDA 的中文词语相似度计算[J]. 北京化工大学学报: 自然科学版, 2016, 43(5): 79-83.
[16] (Lv Yawei, Li Fang, Dai Longlong.Chinese Word Similarity Computing Based on the Latent Dirichelet Allocation(LDA) Model[J]. Journal of Beijing University of Chemical Technology: Natural Science, 2016, 43(5): 79-83.)
[17] 李湘东, 巴志超, 黄莉. 一种基于加权LDA模型和多粒度的文本特征选择方法[J]. 现代图书情报技术, 2015(5): 42-49.
[17] (Li Xiangdong, Ba Zhichao, Huang Li.A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity[J]. New Technology of Library and Information Service, 2015(5): 42-49.)
[18] 张惟皎, 刘春煌, 李芳玉. 聚类质量的评价方法[J]. 计算机工程, 2005, 31(20): 10-12.
[18] (Zhang Weijiao, Liu Chunhuang, Li Fangyu.Method of Quality Evaluation for Clustering[J]. Computer Engineering, 2005, 31(20): 10-12.)
[19] 邓汉成, 王敏芳, 王瑛. 查全率与查准率之间关系的理论研究[J]. 情报学报, 2000, 19(4): 359-362.
[19] (Deng Hancheng, Wang Minfang, Wang Ying.Theoretical Study of the Relationship Between Recall and Precision Ratio[J]. Journal of the China Society for Scientific and Technical Information, 2000, 19(4): 359-362.)
[20] ICTCLAS2016[EB/OL].[2017-03-14]. .
[21] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011, 47(13): 150-153.
[21] (Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[22] 方小飞, 黄孝喜, 王荣波, 等. 基于LDA模型的移动投诉文本热点话题识别[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[22] (Fang Xiaofei, Huang Xiaoxi, Wang Rongbo, et al.Identifying Hot Topics from Mobile Complaint Texts[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 19-27.)
[1] Lixin Xia,Jieyan Zeng,Chongwu Bi,Guanghui Ye. Identifying Hierarchy Evolution of User Interests with LDA Topic Model[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[2] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[3] Peng Guan,Yuefen Wang,Zhu Fu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[4] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[5] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[6] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[7] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[8] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[9] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[10] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[11] Yue He,Yue Feng,Shupeng Zhao,Yufeng Ma. Recommending Contents Based on Zhihu Q&A Community: Case Study of Logistics Topics[J]. 数据分析与知识发现, 2018, 2(9): 42-49.
[12] Yanhua Xu,Yujie Miao,Lin Miao,Xueqiang Lv. Generating HSK Writing Essays with LDA Model[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[13] Ziming Zeng,Qianwen Yang. Sentiment Analysis for Micro-blogs with LDA and AdaBoost[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
[14] Beibei Pang,Juanqiong Gou,Wenxin Mu. Extracting Topics and Their Relationship from College Student Mentoring[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[15] Yan Yu,Naixuan Zhao. Weighted Topic Model for Patent Text Analysis[J]. 数据分析与知识发现, 2018, 2(4): 81-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn