Please wait a minute...
Advanced Search
数据分析与知识发现  2016, Vol. 32 Issue (12): 17-26    DOI: 10.11925/infotech.1003-3513.2016.12.03
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
共词网络LDA模型的中文文本主题分析: 以交通法学文献(2000-2016)为例*
马红1,蔡永明2()
1山东交通学院交通法学院 济南 250357
2济南大学商学院 济南 250022
A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature
Hong Ma1,Yongming Cai2()
1School of Transportation Law, Shandong Jiaotong University, Jinan 250357, China
2Business School, University of Jinan, Jinan 250022, China
全文: PDF(1653 KB)   HTML ( 49
输出: BibTeX | EndNote (RIS)      
摘要 

目的】通过结合传统LDA模型的概率主题抽取方法和共词网络分析发现文献词汇间的联系结构的两者优势, 降低由少量文献产生的高频词汇的干扰, 提高主题凝聚性。【方法】在交通法学文献摘要文本主题分析中, 加入文献的关键词作为分词复合词典, 提高语义识别度; 提出CA-LDA模型(Latent Dirichlet Allocation Model with Co-word Analysis), 在传统LDA模型的基础上加入共词网络分析, 以共词网络拓扑结构参数作为权重控制词汇主题分配(采用介数中心度), 优先提取同时具有高共现性(中介性)和高频率的词汇。【结果】CA-LDA模型可以得到多篇文献同时共现的高频词汇, 这样产生的重点词汇表对主题分析更有意义。该算法的结果不仅仅反映词频概率, 同时也能从词汇关联上发现枢纽词汇, 更深入理解该领域的研究热点。【局限】CA-LDA模型主题数目K的取值采用混淆度标准交叉验证获得, 如果在实际分析中K值太大, 不利于文献主题的分类整理, 未来研究需要对该结果进一步处理来凝聚主题。【结论】本文将该模型应用于交通法学研究领域热点主题分析, 在处理大规模文献数据中取得较好效果。相关研究可以拓展应用于各种领域的大规模文献数据自动化处理中。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
马红
蔡永明
关键词 共词网络LDA主题模型(CA-LDA)共现网络拓扑结构参数随机梯度下降交通法学热词    
Abstract

[Objective]This paper aims to improve the effectiveness of extracting Chinese literature topics with the help of LDA model and co-word network analysis. [Methods] First, we added keywords to the word segmentation dictionary for the abstracts, which improved the semantic recognition of topic analysis. Second, we proposed a Latent Dirichlet Allocation Model with Co-word Analysis (CA-LDA) to control the topic distribution generated by the weight of co-word network topology parameters (i.e. Betweenness Centrality). Finally, we extracted the words with high connectivity (Betweenness Centrality) and frequency. [Results] The CA-LDA model retrieved high frequency and high connectivity words simultaneously, which were important for subject analysis. The proposed algorithm could also identify key node technical vocabularies with the help of co-word analysis. [Limitations] The K value (number of topics) was obtained by cross validation with perplexity. Thus, it was difficult to classify the document topics with larger K value. More research is needed to deal with this issue. [Conclusions] The proposed model effectively analyzes the topics of Chinese literature on transportation laws, which could also process literature data from other fields automatically.

Key wordsLatent Dirichlet Allocation Model with Co-word Analysis    Co-words    Network topology parameters    Stochastic gradient descentin    Key word in transportation law literature
收稿日期: 2016-08-01     
基金资助:*本文系山东省社会科学规划项目“基于复杂网络理论的山东省基础设施系统脆弱性研究”(项目编号: 14CGLJ03)、山东省研究生教学创新项目“基于在线学习的研究生学术素养提升开放式生态系统研究”(项目编号: SDYC15045)和济南市哲学社会科学规划项目“济南市网络预约出租车运营状况调查与管理研究”(项目编号: JNSK16C26)的研究成果之一
引用本文:   
马红, 蔡永明. 共词网络LDA模型的中文文本主题分析: 以交通法学文献(2000-2016)为例*[J]. 数据分析与知识发现, 2016, 32(12): 17-26.
Hong Ma, Yongming Cai. A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.1003-3513.2016.12.03.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.12.03
[1] 范云满, 马建霞. 利用LDA的领域新兴主题探测技术综述[J]. 现代图书情报技术, 2012(12): 58-65.
[1] (Fan Yunman, Ma Jianxia.Review on the LDA-based Techniques Detection for the Field Emerging Topic[J]. New Technology of Library and Information Service, 2012(12): 58-65. )
[2] Day W H E, Edelsbrunner H. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods[J]. Journal of Classification, 1984, 1(1): 7-24.
[3] 曹高辉, 焦玉英, 成全. 基于凝聚式层次聚类算法的标签聚类研究[J]. 现代图书情报技术, 2008(4): 23-28.
[3] (Cao Gaohui, Jiao Yuying, Cheng Quan.Research on Tag Cluster Based on Hierarchical Agglomerative Clustering Algorithm[J]. New Technology of Library and Information Service, 2008(4): 23-28.)
[4] Katz S.Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer[J]. IEEE Transactions on Acoustics, Speech, & Signal Processing, 1987, 35(3): 400-401.
[5] 陈浪舟, 黄泰翼. 一种新颖的词聚类算法和可变长统计语言模型[J]. 计算机学报, 1999, 22(9): 942-948.
[5] (Chen Langzhou, Huang Taiyi.A Novel Word Clustering Algorithm and Vari-Gram Language Model[J]. Chinese Journal of Computers, 1999, 22(9): 942-948.)
[6] Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[7] 庞剑锋, 卜东波, 白硕. 基于向量空间模型的文本自动分类系统的研究与实现[J]. 计算机应用研究, 2001, 27(9): 23-26.
[7] (Pang Jianfeng, Bu Dongbo, Bai Shuo.Research and Implementation of Text Categorization System Based on VSM[J]. Application Research of Computers, 2001, 27(9): 23-26.)
[8] Porteous I, Newman D, Ihler A, et al.Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2008: 569-577.
[9] Newman D, Asuncion A, Smyth P, et al.Distributed Inference for Latent Dirichlet Allocation [C]. In: Proceedings of the 21st Annual Conference on Neural Information Processing Systems. 2007: 1081-1088.
[10] Asuncion A U,Smyth P, Welling M.Asynchronous Distributed Learning of Topic Models [C]. In: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems.2008: 81-88.
[11] Blei D M, Lafferty J D.A Correlated Topic Model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35.
[12] Sato I, Nakagawa H.Topic Models with Power-law Using Pitman-Yor Process [C]. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2010: 673-682.
[13] Teh Y W.Dirichlet Process [A]. //Sammut C, Webb G I. Encyclopedia of Machine Learning[M]. Springer US, 2011: 280-287.
[14] Callon M, Courtial J P, Turner W, et al.From Translations to Problematic Networks: An Introduction to Co-word Analysis[J]. Social Science Information, 1983, 22(2): 191-235.
[15] Callon M, Courtial J P, Laville F.Co-word Analysis as a Tool for Describing the Network of Interactions Between Basic and Technological Research: The Case of Polymer Chemsitry[J]. Scientometrics, 1991, 22(1): 155-205.
[16] Coulter N, Monarch I, Konda S.Software Engineering as Seen Through Its Research Literature: A Study in Co-word Analysis[J]. Journal of the American Society for Information Science, 1998, 49(13): 1206-1223.
[17] 张晓冬, 周宏丽, 胡杨,等. 基于共词分析和社会网络分析的我国计算机集成制造系统研究热点[J]. 科技管理研究, 2016(11): 145-149.
[17] (Zhang Xiaodong, Zhou Hongli, Hu Yang, et al.Research Hotspots of Computer Integrated Manufacturing of China Based on Co-word Analysis and Social Network Analysis[J]. Science and Technology Management Research, 2016(11): 145-149.)
[18] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[19] Newman D, Bonilla E V, Buntine W.Improving Topic Coherence with Regularized Topic Models [C]. In: Proceedings of the 24th International Conference on Neural Information Processing Systems.2011: 496-504.
[20] Jordan M I, Ghahramani Z, Jaakkola T S, et al.An Introduction to Variational Methods for Graphical Models[J]. Machine Learning, 1999, 37(2): 183-233.
[21] Hoffman M, Blei D, Wang C, et al.Stochastic Variational Inference[J]. Journal of Machine Learning Research, 2013, 14(1): 1303-1347.
[22] Brandes U.A Faster Algorithm for Betweenness Centrality[J]. Journal of Mathematical Sociology, 2001, 25(2): 163-177.
[23] Newman M E J. The Structure and Function of Complex Networks[J]. SIAM Review, 2003, 45(2): 167-256.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn