Please wait a minute...
Data Analysis and Knowledge Discovery  2016, Vol. 32 Issue (12): 27-35    DOI: 10.11925/infotech.1003-3513.2016.12.04
Orginal Article Current Issue | Archive | Adv Search |
Classifying Short Texts with Word Embedding and LDA Model
Qun Zhang(),Hongjun Wang,Lunwen Wang
Electronic Engineering Institute of PLA, Hefei 230037, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]This paper proposes a short text classification method with the help of word embedding and LDA model, aiming to address the topic-focus and feature sparsity issues. [Methods] First, we built short text semantic models at the “word” and “text” levels. Second, we trained the word embedding with Word2Vec and created a short text vector at the “word” level. Third, we trained the LDA model with Gibbs sampling, and then expanded the feature of short texts in accordance with the maximum LDA topic probability. Fourth, we calculated the weight of expanded features based on word embedding similarity to obtain short text vector at the “text” level. Finally, we merged the “word” and “text” vectors to establish an integral short text vector and then generated their classification scheme with the k-Nearest Neighbors classifier. [Results] Compared to the traditional singleton-based methods, the precision, recall, F1 of the new method were increased by 3.7%, 4.1% and 3.9%, respectively. [Limitations] Our method was only examined with the k-Nearest Neighbors classifier. More research is needed to study its performance with other classifiers. [Conclusions] The proposed method could effectively improve the performance of short text classification systems.

Key wordsShort text classification      Word embedding      Latent Dirichlet Allocation      k-Nearest Neighbors     
Received: 01 August 2016      Published: 22 January 2017

Cite this article:

Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model. Data Analysis and Knowledge Discovery, 2016, 32(12): 27-35.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.12.04     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I12/27

[1] Yang Y, Liu X.A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2003:42-49.
[2] 邸鹏, 段利国. 一种新型朴素贝叶斯文本分类算法[J]. 数据采集与处理, 2014, 29(1): 71-75.
[2] (Di Peng, Duan Liguo.New Naive Bayes Text Classification Algorithm[J]. Journal of Data Acquisition and Processing, 2014, 29(1): 71-75.)
[3] Joachims T.Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms[M]. Springer Berlin, 2002.
[4] 王仲远, 程健鹏, 王海勋, 等. 短文本理解研究[J]. 计算机研究与发展, 2016, 53(2): 262-269.
[4] (Wang Zhongyuan, Cheng Jianpeng, Wang Haixun, et al.Short Text Understanding: A Survey[J]. Journal of Computer Research and Development, 2016, 53(2): 262-269.)
[5] Lebanon G.Metric Learning for Text Documents[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2006, 28(4): 497-508.
[6] 朱征宇, 孙俊华. 改进的基于知网的词汇语义相似度计算[J]. 计算机应用, 2013, 33(8): 2276-2279,2288.
[6] (Zhu Zhengyu, Sun Junhua.Improved Vocabulary Semantic Similarity Calculation Based on HowNet[J]. Journal of Computer Applications, 2013, 33(8): 2276-2279,2288.)
[7] 王荣波, 谌志群, 周建政, 等. 基于Wikipedia的短文本语义相关度计算方法[J]. 计算机应用与软件, 2015, 32(1): 82-85,92.
[7] (Wang Rongbo, Chen Zhiqun, Zhou Jianzheng, et al.Short Texts Semantic Relevance Computation Method Based on Wikipedia[J]. Computer Applications and Software, 2015, 32(1): 82-85, 92.)
[8] Deerwester S, Dumais S T, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the Association for Information Science and Technology, 1990, 41(6): 391-407.
[9] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[10] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011, 47(13): 150-153.
[10] (Yao Quanzhu, Song Zhili, Peng Cheng.Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[11] Rubin T N, Chambers A, Smyth P, et al.Statistical Topic Models for Multi-label Document Classification[J]. Machine Learning, 2012, 88(1-2): 157-208.
[12] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48.
[12] (Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short-text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[13] Chen M, Jin X, Shen D.Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
[14] Phan X H, Nguyen L M, Horiguchi S.Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections [C]. In: Proceedings of the 17th Information Conference on World Wide Web (WWW’08). New York: ACM, 2008:91-100.
[15] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
[16] Turney P D, Pantel P.From Frequency to Meaning: Vector Space Models of Semantics[J]. Journal of Artificial Intelligence Research, 2010, 37(1): 141-188.
[17] Kim Y.Convolutional Neural Networks for Sentence Classification [C]. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1746-1751.
[18] Chapelle O, Schlkopf B, Zien A.Semi-Supervised Learning[J]. Journal of the Royal Statistical Society, 2010, 6493(10): 2465-2472.
[19] Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[20] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[C]. In: Proceedings of Workshop at ICLR. 2013.
[21] Morin F, Bengio Y.Hierarchical Probabilistic Neural Network Language Model [C]. In: Proceedings of Workshop at AISTATS. 2005.
[22] Porteous I, Newman D, Ihler A, et al.Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA. 2008.
[23] GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) Using Gibbs Sampling for Parameter Estimation and Inference [EB/OL]. [2016-05-15]..
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[3] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[4] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[5] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[6] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[7] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[8] Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[9] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[10] Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[11] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[12] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[13] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[14] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[15] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn