|
|
Clustering Policy Texts Based on LDA Topic Model |
Zhang Tao1( ), Ma Haiqun2 |
1Information and Network Center, Heilongjiang University, Harbin 150080, China 2Research Center of Information Resource Management, Heilongjiang University, Harbin 150080, China |
|
|
Abstract [Objective] This research aims to improve the effectiveness of clustering policy texts with the help of LDA topic model. [Methods] First, we pre-processed the policy texts with the LDA model to generate the training data set. Then, we used the weighted algorithm to determine the optimal number of topics and then clustered the policy texts. [Results] We found that the G value of the weighted clustering results reached peak while the k value was 4. Our results, which were consistent with those of the manual classification, also obtained higher purity and F values. Therefore, the proposed method is effective. [Limitations] Results of each operation in our study will influence the accuracy of the final policy text clustering. [Conclusions] The proposed method could provide directions for the making of new policies, the evaluation of current policies, and the mechanism of two-way interactions.
|
Received: 12 March 2018
Published: 25 October 2018
|
|
[1] |
裴雷, 孙建军, 周兆韬. 政策文本计算: 一种新的政策文本解读方式[J]. 图书与情报, 2016(6): 47-55.
|
[1] |
(Pei Lei, Sun Jianjun, Zhou Zhaotao.Policy Text Computing: A New Methodology of Policy Interpretation[J]. Library and Information, 2016(6): 47-55.)
|
[2] |
黄如花. 信息检索[M].武汉: 武汉大学出版社, 2010.
|
[2] |
(Huang Ruhua.Information Retrieval[M]. Wuhan: Wuhan University Press, 2010.)
|
[3] |
Selvirt P R.Information Retrieval Models: A Survey[J]. International Journal of Research and Reviews in Information Sciences, 2012, 2(3): 227-233.
|
[4] |
李江, 刘源浩, 黄萃, 等. 用文献计量研究重塑政策文本数据分析——政策文献计量的起源、迁移与方法创新[J]. 公共管理学报, 2015, 12(2): 138-144.
|
[4] |
(Li Jiang, Liu Yuanhao, Huang Cui, et al.Remolding the Policy Text Data Through Documents Quantitative Research: The Formation, Transformation and Method Innovation of Policy Documents Quantitative Research[J]. Journal of Public Management, 2015, 12(2): 138-144.)
|
[5] |
刘刚, 刘影, 杜玉丹, 等. 一种政策语篇拟合度递归下降评估算法[J]. 计算机应用研究, 2015, 32(2): 343-346.
|
[5] |
(Liu Gang, Liu Ying, Du Yudan, et al.Recursive Descent Evaluation Algorithm on Policy Context Similarity[J]. Application Research of Computers, 2015, 32(2): 343-346.)
|
[6] |
Kar M, Nunes S, Ribeiro C.Summarization of Changes in Dynamic Text Collections Using Latent Dirichlet Allocation Model[J]. Information Processing and Management, 2015, 51(6): 809-833.
doi: 10.1016/j.ipm.2015.06.002
|
[7] |
Hofmann T.Unsupervised Learning by Probabilistic Latent Semantic Analysis[J]. Machine Learning, 2001, 42(1): 177-196.
doi: 10.1023/A:1007617005950
|
[8] |
Deerwester S, Dumais S T, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
doi: 10.1002/(ISSN)1097-4571
|
[9] |
Blei D M, Ng A Y, Jordan M.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2001,3: 993-1022.
|
[10] |
Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
doi: 10.1073/pnas.0307752101
|
[11] |
曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
doi: 10.3321/j.issn:0254-4164.2008.10.012
|
[11] |
(Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1781-1787.)
doi: 10.3321/j.issn:0254-4164.2008.10.012
|
[12] |
王鹏, 高铖, 陈晓美. 基于LDA模型的文本聚类研究[J]. 情报科学, 2015, 33(1): 63-68.
|
[12] |
(Wang Peng, Gao Cheng, Chen Xiaomei.Research on LDA Model Based on Text Clustering[J]. Information Science, 2015, 33(1): 63-68.)
|
[13] |
王婷婷, 韩满, 王宇. LDA模型的优化及其主题数量选择研究——以科技文献为例[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
|
[13] |
(Wang Tingting, Han Man, Wang Yu.Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. Data Analysis and Knowledge Discovery, 2018, 2(1): 29-40.)
|
[14] |
阮光册, 夏磊. 基于主题模型的检索结果聚类应用研究[J]. 情报杂志, 2017, 36(3): 179-184.
|
[14] |
(Ruan Guangce, Xia Lei.Research on Clustering of Retrieval Results Based on Topic Model[J]. Journal of Intelligence, 2017, 36(3): 179-184.)
|
[15] |
关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
|
[15] |
(Guan Peng, Wang Yuefen.Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
|
[16] |
吕亚伟, 李芳, 戴龙龙. 基于LDA 的中文词语相似度计算[J]. 北京化工大学学报: 自然科学版, 2016, 43(5): 79-83.
|
[16] |
(Lv Yawei, Li Fang, Dai Longlong.Chinese Word Similarity Computing Based on the Latent Dirichelet Allocation(LDA) Model[J]. Journal of Beijing University of Chemical Technology: Natural Science, 2016, 43(5): 79-83.)
|
[17] |
李湘东, 巴志超, 黄莉. 一种基于加权LDA模型和多粒度的文本特征选择方法[J]. 现代图书情报技术, 2015(5): 42-49.
|
[17] |
(Li Xiangdong, Ba Zhichao, Huang Li.A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity[J]. New Technology of Library and Information Service, 2015(5): 42-49.)
|
[18] |
张惟皎, 刘春煌, 李芳玉. 聚类质量的评价方法[J]. 计算机工程, 2005, 31(20): 10-12.
|
[18] |
(Zhang Weijiao, Liu Chunhuang, Li Fangyu.Method of Quality Evaluation for Clustering[J]. Computer Engineering, 2005, 31(20): 10-12.)
|
[19] |
邓汉成, 王敏芳, 王瑛. 查全率与查准率之间关系的理论研究[J]. 情报学报, 2000, 19(4): 359-362.
|
[19] |
(Deng Hancheng, Wang Minfang, Wang Ying.Theoretical Study of the Relationship Between Recall and Precision Ratio[J]. Journal of the China Society for Scientific and Technical Information, 2000, 19(4): 359-362.)
|
[20] |
ICTCLAS2016[EB/OL].[2017-03-14]. .
|
[21] |
姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011, 47(13): 150-153.
doi: 10.3778/j.issn.1002-8331.2011.13.043
|
[21] |
(Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
doi: 10.3778/j.issn.1002-8331.2011.13.043
|
[22] |
方小飞, 黄孝喜, 王荣波, 等. 基于LDA模型的移动投诉文本热点话题识别[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
|
[22] |
(Fang Xiaofei, Huang Xiaoxi, Wang Rongbo, et al.Identifying Hot Topics from Mobile Complaint Texts[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 19-27.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|