Data Analysis and Knowledge Discovery  2016, Vol. 32 Issue (12): 27-35    DOI: 10.11925/infotech.1003-3513.2016.12.04
Classifying Short Texts with Word Embedding and LDA Model
Qun Zhang(),Hongjun Wang,Lunwen Wang
Electronic Engineering Institute of PLA, Hefei 230037, China
[Objective]This paper proposes a short text classification method with the help of word embedding and LDA model, aiming to address the topic-focus and feature sparsity issues. [Methods] First, we built short text semantic models at the “word” and “text” levels. Second, we trained the word embedding with Word2Vec and created a short text vector at the “word” level. Third, we trained the LDA model with Gibbs sampling, and then expanded the feature of short texts in accordance with the maximum LDA topic probability. Fourth, we calculated the weight of expanded features based on word embedding similarity to obtain short text vector at the “text” level. Finally, we merged the “word” and “text” vectors to establish an integral short text vector and then generated their classification scheme with the k-Nearest Neighbors classifier. [Results] Compared to the traditional singleton-based methods, the precision, recall, F1 of the new method were increased by 3.7%, 4.1% and 3.9%, respectively. [Limitations] Our method was only examined with the k-Nearest Neighbors classifier. More research is needed to study its performance with other classifiers. [Conclusions] The proposed method could effectively improve the performance of short text classification systems.

Key wordsShort text classification      Word embedding      Latent Dirichlet Allocation      k-Nearest Neighbors     
Received: 01 August 2016      Published: 22 January 2017

Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model. Data Analysis and Knowledge Discovery, 2016, 32(12): 27-35.

