Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (1): 48-54     https://doi.org/10.11925/infotech.1003-3513.2016.01.08
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种融合外部特征的改进主题模型*
杨如意(),刘东苏,李慧
西安电子科技大学经济与管理学院 西安 710126
An Improved Topic Model Integrating Extra-Features
Ruyi Yang(),Dongsu Liu,Hui Li
School of Economics and Management, Xidian University, Xi’an 710126, China
全文: PDF (1060 KB)   HTML ( 48
输出: BibTeX | EndNote (RIS)      
摘要 【目的】在LDA模型基础上融合时间和作者特征, 提出动态作者主题(DAT)模型, 更好地揭示文本内容、主题和作者之间的关系。【应用背景】从海量文本中实现特征抽取和语义挖掘已经成为情报研究人员的重要工作。【方法】获取NIPS会议论文作为数据集并进行预处理, 按发表年份划分到每个时间片形成一阶马尔科夫链, 使用困惑度确定最优主题数, 并在每个时间片内通过吉布斯采样估算作者主题概率分布和主题词项概率分布。【结果】实验结果表明, 该模型将文档表示为作者主题概率分布和主题词项概率分布, 时间维度上可观测主题强度变化和作者兴趣变化。【结论】DAT模型能够有效地融合文档内容与外部特征, 实现文本挖掘。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨如意
刘东苏
李慧
关键词 LDA模型DAT模型文本挖掘吉布斯采样    
Abstract

[Objective] In order to reveal the relationships between contents, topics and authors of documents, this paper presents the Dynamic Author Topic (DAT) model which extends LDA model. [Context] Extracting features from large-scale texts is an important job for informatics researchers. [Methods] Firstly, collect the NIPS conference papers as data set and make preprocessing with them. Then divide data set into parts by published time, which forms a first-order Markov-chain. Then use perplexity to ensure the number of topics. At last, use Gibbs sampling to estimate the author-topic and topic-words distributions in each time slice. [Results] The results of experiments show that the document is represented as probability distributions of topics-words and authors-topics. On the dimension of time, the revolution of authors and topics can be observed. [Conclusions] DAT model can integrate contents and extra-features efficiently and accomplish text mining.

Key wordsLDA model    DAT model    Text mining    Gibbs sampling
收稿日期: 2015-07-17      出版日期: 2016-02-04
基金资助:*本文系国家自然科学基金青年基金项目“基于可信语义wiki的知识库构建方法与应用研究”(项目编号:71203173)的研究成果之一
引用本文:   
杨如意,刘东苏,李慧. 一种融合外部特征的改进主题模型*[J]. 现代图书情报技术, 2016, 32(1): 48-54.
Ruyi Yang,Dongsu Liu,Hui Li. An Improved Topic Model Integrating Extra-Features. New Technology of Library and Information Service, 2016, 32(1): 48-54.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.01.08      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I1/48
[1] 徐戈, 王厚峰.自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8): 1423-1436.
[1] (Xu Ge, Wang Houfeng.The Development of Topic in Natural Language Processing[J]. Chinese Journal of Computers, 2011, 34(8): 1423-1436.)
[2] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[3] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences, 2004, 101(S1): 5228-5235.
[4] Blei D M, Lafferty J D.A Correlated Topic Model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35.
[5] Li W, McCallum A. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations [C]. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, 2006: 557-584.
[6] Rosen-Zvi M, Griffths T, Steyvers M, et al.The Author Topic Model for Authors and Documents [C]. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2004: 487-494.
[7] Wang X, McCallum A. Topic Over Time: A Non-Markov Continuous-Time Model of Topical Trends [C]. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2006: 424-433.
[8] 李文波, 孙乐, 张大鲲. 基于Labeled-LDA模型的文本分类新算法[J].计算机学报, 2008, 31(4): 620-627.
[8] (Li Wenbo, Sun Le, Zhang Dakun.Text Classification Based on Labeled-LDA Model[J]. Chinese Journal of Computers, 2008, 31(4): 620-627.)
[9] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590.
[9] (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[10] 胡吉明, 陈果. 基于动态LDA主题模型的内容主题挖掘与演化[J]. 图书情报工作, 2014, 58(2): 138-142.
[10] (Hu Jiming, Chen Guo.Mining and Evolution of Content Topics Based on Dynamic LDA[J]. Library and Information Service, 2014, 58(2): 138-142.)
[11] Blei D M, Lafferty J D.Dynamic Topic Models [C]. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, 2006: 113-120.
[12] Azzopardi L, Girolami M, Van Risjbergen K, et al.Investigating the Relationship Between Language Model Perplexity and IR Precision-Recall Measure [C]. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: ACM Press, 2003: 369-370.
[13] Minka T, Lafferty J.Expectation Propagation for the Generative Aspect Model [C]. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence.Morgan Kaufmann Publishers Inc., 2002: 352-359.
[14] Teh Y W, Newman D, Welling M.A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation [C]. In: Proceedings of the Neural Information Processing Systems Conference.2006.
[15] 史庆伟, 乔晓东, 徐硕, 等.作者主题演化模型及其在研究兴趣演化分析中的应用[J].情报学报, 2013, 32(9): 912-919.
[15] (Shi Qingwei, Qiao Xiaodong, Xu Shuo, et al.Author-Topic Evolution Model and Its Application in Analysis of Research Interests Evolution[J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(9): 912-919.)
[1] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[2] 许光,任明,宋城宇. 西方媒体新闻中的中国经济形象提取*[J]. 数据分析与知识发现, 2021, 5(5): 30-40.
[3] 代冰,胡正银. 基于文献的知识发现新近研究综述 *[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
[4] 余传明, 王曼怡, 林虹君, 朱星宇, 黄婷婷, 安璐. 基于深度学习的词汇表示模型对比研究*[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[5] 夏天. 面向中文学术文本的单文档关键短语抽取 *[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[6] 蔡永明,刘璐,王科唯. 网络虚拟学习社区重要用户与核心主题联合分析*[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[7] 刘玉文,王凯. 面向地域的网络话题识别方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[8] 马建霞,袁慧,蒋翔. 基于Bi-LSTM+CRF的科学文献中生态治理技术相关命名实体抽取研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[9] 叶光辉,徐彤,毕崇武,李心悦. 基于多维度特征与LDA模型的城市旅游画像演化分析*[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[10] 杜建. 医学知识不确定性测度的进展与展望*[J]. 数据分析与知识发现, 2020, 4(10): 14-27.
[11] 关鹏,王曰芬. 国内外专利网络研究进展*[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[12] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[13] 黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 *[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[14] 杨亚楠,赵文辉,张健,谭珅,张贝贝. 基于多视图协同的政策文本可视化研究*[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[15] 张梦吉,杜婉钰,郑楠. 引入新闻短文本的个股走势预测模型[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn