Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (1): 48-54    DOI: 10.11925/infotech.1003-3513.2016.01.08
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种融合外部特征的改进主题模型*
杨如意(),刘东苏,李慧
西安电子科技大学经济与管理学院 西安 710126
An Improved Topic Model Integrating Extra-Features
Ruyi Yang(),Dongsu Liu,Hui Li
School of Economics and Management, Xidian University, Xi’an 710126, China
全文: PDF(1060 KB)   HTML ( 48
输出: BibTeX | EndNote (RIS)      
摘要 【目的】在LDA模型基础上融合时间和作者特征, 提出动态作者主题(DAT)模型, 更好地揭示文本内容、主题和作者之间的关系。【应用背景】从海量文本中实现特征抽取和语义挖掘已经成为情报研究人员的重要工作。【方法】获取NIPS会议论文作为数据集并进行预处理, 按发表年份划分到每个时间片形成一阶马尔科夫链, 使用困惑度确定最优主题数, 并在每个时间片内通过吉布斯采样估算作者主题概率分布和主题词项概率分布。【结果】实验结果表明, 该模型将文档表示为作者主题概率分布和主题词项概率分布, 时间维度上可观测主题强度变化和作者兴趣变化。【结论】DAT模型能够有效地融合文档内容与外部特征, 实现文本挖掘。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨如意
刘东苏
李慧
关键词 LDA模型DAT模型文本挖掘吉布斯采样    
Abstract

[Objective] In order to reveal the relationships between contents, topics and authors of documents, this paper presents the Dynamic Author Topic (DAT) model which extends LDA model. [Context] Extracting features from large-scale texts is an important job for informatics researchers. [Methods] Firstly, collect the NIPS conference papers as data set and make preprocessing with them. Then divide data set into parts by published time, which forms a first-order Markov-chain. Then use perplexity to ensure the number of topics. At last, use Gibbs sampling to estimate the author-topic and topic-words distributions in each time slice. [Results] The results of experiments show that the document is represented as probability distributions of topics-words and authors-topics. On the dimension of time, the revolution of authors and topics can be observed. [Conclusions] DAT model can integrate contents and extra-features efficiently and accomplish text mining.

Key wordsLDA model    DAT model    Text mining    Gibbs sampling
收稿日期: 2015-07-17     
基金资助:*本文系国家自然科学基金青年基金项目“基于可信语义wiki的知识库构建方法与应用研究”(项目编号:71203173)的研究成果之一
引用本文:   
杨如意,刘东苏,李慧. 一种融合外部特征的改进主题模型*[J]. 现代图书情报技术, 2016, 32(1): 48-54.
Ruyi Yang,Dongsu Liu,Hui Li. An Improved Topic Model Integrating Extra-Features. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.01.08.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.01.08
[1] 徐戈, 王厚峰.自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8): 1423-1436.
[1] (Xu Ge, Wang Houfeng.The Development of Topic in Natural Language Processing[J]. Chinese Journal of Computers, 2011, 34(8): 1423-1436.)
[2] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[3] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences, 2004, 101(S1): 5228-5235.
[4] Blei D M, Lafferty J D.A Correlated Topic Model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35.
[5] Li W, McCallum A. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations [C]. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, 2006: 557-584.
[6] Rosen-Zvi M, Griffths T, Steyvers M, et al.The Author Topic Model for Authors and Documents [C]. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2004: 487-494.
[7] Wang X, McCallum A. Topic Over Time: A Non-Markov Continuous-Time Model of Topical Trends [C]. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2006: 424-433.
[8] 李文波, 孙乐, 张大鲲. 基于Labeled-LDA模型的文本分类新算法[J].计算机学报, 2008, 31(4): 620-627.
[8] (Li Wenbo, Sun Le, Zhang Dakun.Text Classification Based on Labeled-LDA Model[J]. Chinese Journal of Computers, 2008, 31(4): 620-627.)
[9] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590.
[9] (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[10] 胡吉明, 陈果. 基于动态LDA主题模型的内容主题挖掘与演化[J]. 图书情报工作, 2014, 58(2): 138-142.
[10] (Hu Jiming, Chen Guo.Mining and Evolution of Content Topics Based on Dynamic LDA[J]. Library and Information Service, 2014, 58(2): 138-142.)
[11] Blei D M, Lafferty J D.Dynamic Topic Models [C]. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, 2006: 113-120.
[12] Azzopardi L, Girolami M, Van Risjbergen K, et al.Investigating the Relationship Between Language Model Perplexity and IR Precision-Recall Measure [C]. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: ACM Press, 2003: 369-370.
[13] Minka T, Lafferty J.Expectation Propagation for the Generative Aspect Model [C]. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence.Morgan Kaufmann Publishers Inc., 2002: 352-359.
[14] Teh Y W, Newman D, Welling M.A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation [C]. In: Proceedings of the Neural Information Processing Systems Conference.2006.
[15] 史庆伟, 乔晓东, 徐硕, 等.作者主题演化模型及其在研究兴趣演化分析中的应用[J].情报学报, 2013, 32(9): 912-919.
[15] (Shi Qingwei, Qiao Xiaodong, Xu Shuo, et al.Author-Topic Evolution Model and Its Application in Analysis of Research Interests Evolution[J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(9): 912-919.)
[1] 杨亚楠,赵文辉,张健,谭珅,张贝贝. 基于多视图协同的政策文本可视化研究*[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[2] 张梦吉,杜婉钰,郑楠. 引入新闻短文本的个股走势预测模型[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[3] 何跃,丰月,赵书朋,马玉凤. 基于知乎问答社区的内容推荐研究——以物流话题为例[J]. 数据分析与知识发现, 2018, 2(9): 42-49.
[4] 徐艳华,苗雨洁,苗琳,吕学强. 基于LDA模型的HSK作文生成*[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[5] 张宁,尹乐民,何立峰. 网络股评“发布者-关注者”BSI与股票市场关联性研究*[J]. 数据分析与知识发现, 2018, 2(6): 1-12.
[6] 范馨月,崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[7] 王璟琦,李锐,吴华意. 基于空间自相关的网络舆情话题演化时空规律分析*[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[8] 李真,丁晟春,王楠. 网络舆情观点主题识别研究*[J]. 数据分析与知识发现, 2017, 1(8): 18-30.
[9] 方小飞,黄孝喜,王荣波,谌志群,王小华. 基于LDA模型的移动投诉文本热点话题识别*[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[10] 汪强兵,章成志. 融合内容与用户手势行为的用户画像构建系统设计与实现*[J]. 数据分析与知识发现, 2017, 1(2): 80-86.
[11] 谢秀芳,张晓林. 针对科技路线图的文本挖掘研究: 集成分析及可视化*[J]. 数据分析与知识发现, 2017, 1(1): 16-25.
[12] 姚兆旭,马静. 面向微博话题的“主题+观点”词条抽取算法研究*[J]. 现代图书情报技术, 2016, 32(7-8): 78-86.
[13] 兰秋军,刘文星,李卫康,胡星野. 融合句法信息的金融论坛文本情感计算研究*[J]. 现代图书情报技术, 2016, 32(4): 64-71.
[14] 张磊,马静,李丹丹,沈洋. 语义社会网络的超网络模型构建及关键节点自动化识别方法研究*[J]. 现代图书情报技术, 2016, 32(3): 8-17.
[15] 毕强, 刘健, 鲍玉来. 基于语义相似度的文本聚类研究*[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn