Please wait a minute...
Advanced Search
数据分析与知识发现  2016, Vol. 32 Issue (12): 27-35     https://doi.org/10.11925/infotech.1003-3513.2016.12.04
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
词向量与LDA相融合的短文本分类方法*
张群(),王红军,王伦文
中国人民解放军电子工程学院 合肥 230037
Classifying Short Texts with Word Embedding and LDA Model
Qun Zhang(),Hongjun Wang,Lunwen Wang
Electronic Engineering Institute of PLA, Hefei 230037, China
全文: PDF (565 KB)   HTML ( 67
输出: BibTeX | EndNote (RIS)      
摘要 

目的】针对短文本主题聚焦性差以及严重的特征稀疏问题, 设计一种基于词向量与LDA主题模型相融合的短文本分类方法。【方法】从“词”粒度及“文本”粒度层面同时对短文本进行精细语义建模, 首先基于Word2Vec训练词向量并通过相加平均法合成“词”粒度层面的短文本向量, 基于吉布斯采样法训练LDA主题模型并根据主题概率最大原则对短文本进行特征扩展, 然后基于词向量相似度计算扩展特征权重得到“文本”粒度层面的短文本向量, 最后通过向量拼接构建词向量与LDA相融合的短文本表示模型, 在此基础上通过最近邻分类算法完成短文本分类。【结果】相比传统的基于向量空间模型、基于词向量、基于LDA主题模型这三种基于单一模型的分类方法, 词向量与LDA相融合的分类方法准确率、召回率、F1值均有提升, 分别至少提升3.7%, 4.1%和3.9%。【局限】仅应用于最近邻分类器, 尚未推广应用到朴素贝叶斯和支持向量机等多种不同的分类器。【结论】基于词向量与LDA相融合的短文本表示模型进行分类, 能有效克服短文本的主题聚焦性差及特征稀疏性问题, 提高短文本分类性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张群
王红军
王伦文
关键词 短文本分类词向量LDA主题模型最近邻分类器    
Abstract

[Objective]This paper proposes a short text classification method with the help of word embedding and LDA model, aiming to address the topic-focus and feature sparsity issues. [Methods] First, we built short text semantic models at the “word” and “text” levels. Second, we trained the word embedding with Word2Vec and created a short text vector at the “word” level. Third, we trained the LDA model with Gibbs sampling, and then expanded the feature of short texts in accordance with the maximum LDA topic probability. Fourth, we calculated the weight of expanded features based on word embedding similarity to obtain short text vector at the “text” level. Finally, we merged the “word” and “text” vectors to establish an integral short text vector and then generated their classification scheme with the k-Nearest Neighbors classifier. [Results] Compared to the traditional singleton-based methods, the precision, recall, F1 of the new method were increased by 3.7%, 4.1% and 3.9%, respectively. [Limitations] Our method was only examined with the k-Nearest Neighbors classifier. More research is needed to study its performance with other classifiers. [Conclusions] The proposed method could effectively improve the performance of short text classification systems.

Key wordsShort text classification    Word embedding    Latent Dirichlet Allocation    k-Nearest Neighbors
收稿日期: 2016-08-01      出版日期: 2017-01-22
基金资助:*本文系国家自然科学基金项目“动态数据挖掘的构造性机器学习方法研究”(项目编号: 61273302)的研究成果之一
引用本文:   
张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model. Data Analysis and Knowledge Discovery, 2016, 32(12): 27-35.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.12.04      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I12/27
[1] Yang Y, Liu X.A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2003:42-49.
[2] 邸鹏, 段利国. 一种新型朴素贝叶斯文本分类算法[J]. 数据采集与处理, 2014, 29(1): 71-75.
[2] (Di Peng, Duan Liguo.New Naive Bayes Text Classification Algorithm[J]. Journal of Data Acquisition and Processing, 2014, 29(1): 71-75.)
[3] Joachims T.Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms[M]. Springer Berlin, 2002.
[4] 王仲远, 程健鹏, 王海勋, 等. 短文本理解研究[J]. 计算机研究与发展, 2016, 53(2): 262-269.
[4] (Wang Zhongyuan, Cheng Jianpeng, Wang Haixun, et al.Short Text Understanding: A Survey[J]. Journal of Computer Research and Development, 2016, 53(2): 262-269.)
[5] Lebanon G.Metric Learning for Text Documents[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2006, 28(4): 497-508.
[6] 朱征宇, 孙俊华. 改进的基于知网的词汇语义相似度计算[J]. 计算机应用, 2013, 33(8): 2276-2279,2288.
[6] (Zhu Zhengyu, Sun Junhua.Improved Vocabulary Semantic Similarity Calculation Based on HowNet[J]. Journal of Computer Applications, 2013, 33(8): 2276-2279,2288.)
[7] 王荣波, 谌志群, 周建政, 等. 基于Wikipedia的短文本语义相关度计算方法[J]. 计算机应用与软件, 2015, 32(1): 82-85,92.
[7] (Wang Rongbo, Chen Zhiqun, Zhou Jianzheng, et al.Short Texts Semantic Relevance Computation Method Based on Wikipedia[J]. Computer Applications and Software, 2015, 32(1): 82-85, 92.)
[8] Deerwester S, Dumais S T, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the Association for Information Science and Technology, 1990, 41(6): 391-407.
[9] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[10] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011, 47(13): 150-153.
[10] (Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[11] Rubin T N, Chambers A, Smyth P, et al.Statistical Topic Models for Multi-label Document Classification[J]. Machine Learning, 2012, 88(1-2): 157-208.
[12] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48.
[12] (Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short-text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[13] Chen M, Jin X, Shen D.Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
[14] Phan X H, Nguyen L M, Horiguchi S.Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections [C]. In: Proceedings of the 17th Information Conference on World Wide Web (WWW’08). New York: ACM, 2008:91-100.
[15] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
[16] Turney P D, Pantel P.From Frequency to Meaning: Vector Space Models of Semantics[J]. Journal of Artificial Intelligence Research, 2010, 37(1): 141-188.
[17] Kim Y.Convolutional Neural Networks for Sentence Classification [C]. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1746-1751.
[18] Chapelle O, Schlkopf B, Zien A.Semi-Supervised Learning[J]. Journal of the Royal Statistical Society, 2010, 6493(10): 2465-2472.
[19] Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[20] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[C]. In: Proceedings of Workshop at ICLR. 2013.
[21] Morin F, Bengio Y.Hierarchical Probabilistic Neural Network Language Model [C]. In: Proceedings of Workshop at AISTATS. 2005.
[22] Porteous I, Newman D, Ihler A, et al.Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA. 2008.
[23] GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) Using Gibbs Sampling for Parameter Estimation and Inference [EB/OL]. [2016-05-15]..
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 张建东, 陈仕吉, 徐小婷, 左文革. 基于词向量的PDF表格抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[3] 戴志宏, 郝晓玲. 上下位关系抽取方法及其在金融市场的应用*[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[4] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[5] 魏庭新,柏文雷,曲维光. 词向量和语义知识相结合的汉语未登录词语义预测研究*[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[6] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[7] 聂卉,何欢. 引入词向量的隐性特征识别研究*[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[8] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[9] 俞琰,陈磊,姜金德,赵乃瑄. 结合词向量和统计特征的专利相似度测量方法 *[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[10] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[11] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[12] 文秀贤,徐健. 基于用户评论的商品特征提取及特征价格研究 *[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[13] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[14] 张佩瑶,刘东苏. 基于词向量和BTM的短文本话题演化分析*[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[15] 席林娜,窦永香. 基于计划行为理论的微博用户转发行为影响因素研究*[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn