Please wait a minute...
Data Analysis and Knowledge Discovery  2016, Vol. 32 Issue (12): 17-26    DOI: 10.11925/infotech.1003-3513.2016.12.03
Orginal Article Current Issue | Archive | Adv Search |
A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature
Hong Ma1,Yongming Cai2()
1School of Transportation Law, Shandong Jiaotong University, Jinan 250357, China
2Business School, University of Jinan, Jinan 250022, China
Download: PDF(1653 KB)   HTML ( 48
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]This paper aims to improve the effectiveness of extracting Chinese literature topics with the help of LDA model and co-word network analysis. [Methods] First, we added keywords to the word segmentation dictionary for the abstracts, which improved the semantic recognition of topic analysis. Second, we proposed a Latent Dirichlet Allocation Model with Co-word Analysis (CA-LDA) to control the topic distribution generated by the weight of co-word network topology parameters (i.e. Betweenness Centrality). Finally, we extracted the words with high connectivity (Betweenness Centrality) and frequency. [Results] The CA-LDA model retrieved high frequency and high connectivity words simultaneously, which were important for subject analysis. The proposed algorithm could also identify key node technical vocabularies with the help of co-word analysis. [Limitations] The K value (number of topics) was obtained by cross validation with perplexity. Thus, it was difficult to classify the document topics with larger K value. More research is needed to deal with this issue. [Conclusions] The proposed model effectively analyzes the topics of Chinese literature on transportation laws, which could also process literature data from other fields automatically.

Key wordsLatent Dirichlet Allocation Model with Co-word Analysis      Co-words      Network topology parameters      Stochastic gradient descentin      Key word in transportation law literature     
Received: 01 August 2016      Published: 22 January 2017

Cite this article:

Hong Ma, Yongming Cai. A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature. Data Analysis and Knowledge Discovery, 2016, 32(12): 17-26.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.12.03     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I12/17

[1] 范云满, 马建霞. 利用LDA的领域新兴主题探测技术综述[J]. 现代图书情报技术, 2012(12): 58-65.
[1] (Fan Yunman, Ma Jianxia.Review on the LDA-based Techniques Detection for the Field Emerging Topic[J]. New Technology of Library and Information Service, 2012(12): 58-65. )
[2] Day W H E, Edelsbrunner H. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods[J]. Journal of Classification, 1984, 1(1): 7-24.
[3] 曹高辉, 焦玉英, 成全. 基于凝聚式层次聚类算法的标签聚类研究[J]. 现代图书情报技术, 2008(4): 23-28.
[3] (Cao Gaohui, Jiao Yuying, Cheng Quan.Research on Tag Cluster Based on Hierarchical Agglomerative Clustering Algorithm[J]. New Technology of Library and Information Service, 2008(4): 23-28.)
[4] Katz S.Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer[J]. IEEE Transactions on Acoustics, Speech, & Signal Processing, 1987, 35(3): 400-401.
[5] 陈浪舟, 黄泰翼. 一种新颖的词聚类算法和可变长统计语言模型[J]. 计算机学报, 1999, 22(9): 942-948.
[5] (Chen Langzhou, Huang Taiyi.A Novel Word Clustering Algorithm and Vari-Gram Language Model[J]. Chinese Journal of Computers, 1999, 22(9): 942-948.)
[6] Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[7] 庞剑锋, 卜东波, 白硕. 基于向量空间模型的文本自动分类系统的研究与实现[J]. 计算机应用研究, 2001, 27(9): 23-26.
[7] (Pang Jianfeng, Bu Dongbo, Bai Shuo.Research and Implementation of Text Categorization System Based on VSM[J]. Application Research of Computers, 2001, 27(9): 23-26.)
[8] Porteous I, Newman D, Ihler A, et al.Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2008: 569-577.
[9] Newman D, Asuncion A, Smyth P, et al.Distributed Inference for Latent Dirichlet Allocation [C]. In: Proceedings of the 21st Annual Conference on Neural Information Processing Systems. 2007: 1081-1088.
[10] Asuncion A U,Smyth P, Welling M.Asynchronous Distributed Learning of Topic Models [C]. In: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems.2008: 81-88.
[11] Blei D M, Lafferty J D.A Correlated Topic Model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35.
[12] Sato I, Nakagawa H.Topic Models with Power-law Using Pitman-Yor Process [C]. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2010: 673-682.
[13] Teh Y W.Dirichlet Process [A]. //Sammut C, Webb G I. Encyclopedia of Machine Learning[M]. Springer US, 2011: 280-287.
[14] Callon M, Courtial J P, Turner W, et al.From Translations to Problematic Networks: An Introduction to Co-word Analysis[J]. Social Science Information, 1983, 22(2): 191-235.
[15] Callon M, Courtial J P, Laville F.Co-word Analysis as a Tool for Describing the Network of Interactions Between Basic and Technological Research: The Case of Polymer Chemsitry[J]. Scientometrics, 1991, 22(1): 155-205.
[16] Coulter N, Monarch I, Konda S.Software Engineering as Seen Through Its Research Literature: A Study in Co-word Analysis[J]. Journal of the American Society for Information Science, 1998, 49(13): 1206-1223.
[17] 张晓冬, 周宏丽, 胡杨,等. 基于共词分析和社会网络分析的我国计算机集成制造系统研究热点[J]. 科技管理研究, 2016(11): 145-149.
[17] (Zhang Xiaodong, Zhou Hongli, Hu Yang, et al.Research Hotspots of Computer Integrated Manufacturing of China Based on Co-word Analysis and Social Network Analysis[J]. Science and Technology Management Research, 2016(11): 145-149.)
[18] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[19] Newman D, Bonilla E V, Buntine W.Improving Topic Coherence with Regularized Topic Models [C]. In: Proceedings of the 24th International Conference on Neural Information Processing Systems.2011: 496-504.
[20] Jordan M I, Ghahramani Z, Jaakkola T S, et al.An Introduction to Variational Methods for Graphical Models[J]. Machine Learning, 1999, 37(2): 183-233.
[21] Hoffman M, Blei D, Wang C, et al.Stochastic Variational Inference[J]. Journal of Machine Learning Research, 2013, 14(1): 1303-1347.
[22] Brandes U.A Faster Algorithm for Betweenness Centrality[J]. Journal of Mathematical Sociology, 2001, 25(2): 163-177.
[23] Newman M E J. The Structure and Function of Complex Networks[J]. SIAM Review, 2003, 45(2): 167-256.
No related articles found!
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn