%A Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong %T Comparing Text Vector Generators for Weibo Short Text Classification %0 Journal Article %D 2018 %J Data Analysis and Knowledge Discovery %R 10.11925/infotech.2096-3467.2018.0322 %P 41-50 %V 2 %N 8 %U {https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/abstract/article_4538.shtml} %8 2018-08-25 %X

[Objective] This paper uses the Word2Vec and Sent2Vec algorithms to generate vectors for the text posts of Sina Weibo, aiming to achieve lower computational cost and higher efficiency in text classification. [Methods] First, we classified words from the posts with the 0-1 matrix and used results as the baseline. Then, we used the Word2Vec algorithm to generate the word vector and the vector representation of the sentences in different ways. Third, we classified the Weibo posts using sentence vectors generated by the Sent2Vec algorithm. Finally we comprehensively evaluated the advantages and disadvantages of the three methods. [Results] Both Word2Vec and Sent2Vec algorithms could reduce the text features significantly. We used 30,000 words as features and found Word2Vec and Sent2Vec algorithms could reduce feature numbers to less than 1000. The classification accuracy rate of the Word2Vec algorithm was 75.14%, which was 3% lower than the baseline. The accuracy rate of the Sent2Vec algorithm was far less than the other two methods, with the accuracy rate was only 63.08%. [Limitations] The corpus size of this paper needs to be expanded. We found that the Word2Vec algorithm did not have enough semantic information to calculate word vector. However, Sent2Vec has poor classification results for Chinese sentence vectors. [Conclusions] Word2Vec algorithm is suitable for large-scale corpus classification, and words should be used as classification features for lack of text.