1Information and Network Center, Heilongjiang University, Harbin 150080, China 2Research Center of Information Resource Management, Heilongjiang University, Harbin 150080, China
[Objective] This research aims to improve the effectiveness of clustering policy texts with the help of LDA topic model. [Methods] First, we pre-processed the policy texts with the LDA model to generate the training data set. Then, we used the weighted algorithm to determine the optimal number of topics and then clustered the policy texts. [Results] We found that the G value of the weighted clustering results reached peak while the k value was 4. Our results, which were consistent with those of the manual classification, also obtained higher purity and F values. Therefore, the proposed method is effective. [Limitations] Results of each operation in our study will influence the accuracy of the final policy text clustering. [Conclusions] The proposed method could provide directions for the making of new policies, the evaluation of current policies, and the mechanism of two-way interactions.
(Li Jiang, Liu Yuanhao, Huang Cui, et al.Remolding the Policy Text Data Through Documents Quantitative Research: The Formation, Transformation and Method Innovation of Policy Documents Quantitative Research[J]. Journal of Public Management, 2015, 12(2): 138-144.)
(Liu Gang, Liu Ying, Du Yudan, et al.Recursive Descent Evaluation Algorithm on Policy Context Similarity[J]. Application Research of Computers, 2015, 32(2): 343-346.)
[6]
Kar M, Nunes S, Ribeiro C.Summarization of Changes in Dynamic Text Collections Using Latent Dirichlet Allocation Model[J]. Information Processing and Management, 2015, 51(6): 809-833.
doi: 10.1016/j.ipm.2015.06.002
Deerwester S, Dumais S T, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
doi: 10.1002/(ISSN)1097-4571
[9]
Blei D M, Ng A Y, Jordan M.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2001,3: 993-1022.
[10]
Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
doi: 10.1073/pnas.0307752101
(Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1781-1787.)
doi: 10.3321/j.issn:0254-4164.2008.10.012
(Wang Tingting, Han Man, Wang Yu.Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. Data Analysis and Knowledge Discovery, 2018, 2(1): 29-40.)
(Guan Peng, Wang Yuefen.Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
(Lv Yawei, Li Fang, Dai Longlong.Chinese Word Similarity Computing Based on the Latent Dirichelet Allocation(LDA) Model[J]. Journal of Beijing University of Chemical Technology: Natural Science, 2016, 43(5): 79-83.)
(Li Xiangdong, Ba Zhichao, Huang Li.A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity[J]. New Technology of Library and Information Service, 2015(5): 42-49.)
(Deng Hancheng, Wang Minfang, Wang Ying.Theoretical Study of the Relationship Between Recall and Precision Ratio[J]. Journal of the China Society for Scientific and Technical Information, 2000, 19(4): 359-362.)
(Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
doi: 10.3778/j.issn.1002-8331.2011.13.043
(Fang Xiaofei, Huang Xiaoxi, Wang Rongbo, et al.Identifying Hot Topics from Mobile Complaint Texts[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 19-27.)