Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (1): 48-54    DOI: 10.11925/infotech.1003-3513.2016.01.08
Orginal Article Current Issue | Archive | Adv Search |
An Improved Topic Model Integrating Extra-Features
Ruyi Yang(),Dongsu Liu,Hui Li
School of Economics and Management, Xidian University, Xi’an 710126, China
Download: PDF(1060 KB)   HTML ( 48
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] In order to reveal the relationships between contents, topics and authors of documents, this paper presents the Dynamic Author Topic (DAT) model which extends LDA model. [Context] Extracting features from large-scale texts is an important job for informatics researchers. [Methods] Firstly, collect the NIPS conference papers as data set and make preprocessing with them. Then divide data set into parts by published time, which forms a first-order Markov-chain. Then use perplexity to ensure the number of topics. At last, use Gibbs sampling to estimate the author-topic and topic-words distributions in each time slice. [Results] The results of experiments show that the document is represented as probability distributions of topics-words and authors-topics. On the dimension of time, the revolution of authors and topics can be observed. [Conclusions] DAT model can integrate contents and extra-features efficiently and accomplish text mining.

Key wordsLDA model      DAT model      Text mining      Gibbs sampling     
Received: 17 July 2015      Published: 04 February 2016

Cite this article:

Ruyi Yang,Dongsu Liu,Hui Li. An Improved Topic Model Integrating Extra-Features. New Technology of Library and Information Service, 2016, 32(1): 48-54.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.01.08     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I1/48

[1] 徐戈, 王厚峰.自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8): 1423-1436.
[1] (Xu Ge, Wang Houfeng.The Development of Topic in Natural Language Processing[J]. Chinese Journal of Computers, 2011, 34(8): 1423-1436.)
[2] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[3] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences, 2004, 101(S1): 5228-5235.
[4] Blei D M, Lafferty J D.A Correlated Topic Model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35.
[5] Li W, McCallum A. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations [C]. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, 2006: 557-584.
[6] Rosen-Zvi M, Griffths T, Steyvers M, et al.The Author Topic Model for Authors and Documents [C]. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2004: 487-494.
[7] Wang X, McCallum A. Topic Over Time: A Non-Markov Continuous-Time Model of Topical Trends [C]. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2006: 424-433.
[8] 李文波, 孙乐, 张大鲲. 基于Labeled-LDA模型的文本分类新算法[J].计算机学报, 2008, 31(4): 620-627.
[8] (Li Wenbo, Sun Le, Zhang Dakun.Text Classification Based on Labeled-LDA Model[J]. Chinese Journal of Computers, 2008, 31(4): 620-627.)
[9] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590.
[9] (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[10] 胡吉明, 陈果. 基于动态LDA主题模型的内容主题挖掘与演化[J]. 图书情报工作, 2014, 58(2): 138-142.
[10] (Hu Jiming, Chen Guo.Mining and Evolution of Content Topics Based on Dynamic LDA[J]. Library and Information Service, 2014, 58(2): 138-142.)
[11] Blei D M, Lafferty J D.Dynamic Topic Models [C]. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, 2006: 113-120.
[12] Azzopardi L, Girolami M, Van Risjbergen K, et al.Investigating the Relationship Between Language Model Perplexity and IR Precision-Recall Measure [C]. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: ACM Press, 2003: 369-370.
[13] Minka T, Lafferty J.Expectation Propagation for the Generative Aspect Model [C]. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence.Morgan Kaufmann Publishers Inc., 2002: 352-359.
[14] Teh Y W, Newman D, Welling M.A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation [C]. In: Proceedings of the Neural Information Processing Systems Conference.2006.
[15] 史庆伟, 乔晓东, 徐硕, 等.作者主题演化模型及其在研究兴趣演化分析中的应用[J].情报学报, 2013, 32(9): 912-919.
[15] (Shi Qingwei, Qiao Xiaodong, Xu Shuo, et al.Author-Topic Evolution Model and Its Application in Analysis of Research Interests Evolution[J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(9): 912-919.)
[1] Yanan Yang,Wenhui Zhao,Jian Zhang,Shen Tan,Beibei Zhang. Visualizing Policy Texts Based on Multi-View Collaboration[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[2] Mengji Zhang,Wanyu Du,Nan Zheng. Predicting Stock Trends Based on News Events[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[3] Yanhua Xu,Yujie Miao,Lin Miao,Xueqiang Lv. Generating HSK Writing Essays with LDA Model[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[4] Ning Zhang,Lemin Yin,Lifeng He. Impacts of “Poster-Follower” Sentiment on Stock Market Performance[J]. 数据分析与知识发现, 2018, 2(6): 1-12.
[5] Xinyue Fan,Lei Cui. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[6] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[7] Jingqi Wang,Rui Li,Huayi Wu. The Evolution of Online Public Opinion Based on Spatial Autocorrelation[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[8] Zhen Li,Shengchun Ding,Nan Wang. Identifying Topics of Online Public Opinion[J]. 数据分析与知识发现, 2017, 1(8): 18-30.
[9] Xiaofei Fang,Xiaoxi Huang,Rongbo Wang,Zhiqun Chen,Xiaohua Wang. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[10] Qiangbing Wang,Chengzhi Zhang. Constructing Users Profiles with Content and Gesture Behaviors[J]. 数据分析与知识发现, 2017, 1(2): 80-86.
[11] Xiufang Xie,Xiaolin Zhang. Integrated Analysis and Visualization of Sci-Tech Roadmaps: Case Study of Renewable Energy[J]. 数据分析与知识发现, 2017, 1(1): 16-25.
[12] Yao Zhaoxu,Ma Jing. Extracting Topic and Opinion from Microblog Posts with New Algorithm[J]. 现代图书情报技术, 2016, 32(7-8): 78-86.
[13] Lan Qiujun,Liu Wenxing,Li Weikang,Hu Xingye. Sentiment Analysis of Financial Forum Textual Message[J]. 现代图书情报技术, 2016, 32(4): 64-71.
[14] Zhang Lei,Ma Jing,Li Dandan,Shen Yang. Hypernetwork Model for Semantic Social Network and Automatic Identification of Key Nodes[J]. 现代图书情报技术, 2016, 32(3): 8-17.
[15] Qiang Bi, Jian Liu, Yulai Bao. A New Text Clustering Method Based on Semantic Similarity[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn