Comparing Text Vector Generators for Weibo Short Text Classification
Li Xinlei, Wang Hao(), Liu Xiaomin, Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] This paper uses the Word2Vec and Sent2Vec algorithms to generate vectors for the text posts of Sina Weibo, aiming to achieve lower computational cost and higher efficiency in text classification. [Methods] First, we classified words from the posts with the 0-1 matrix and used results as the baseline. Then, we used the Word2Vec algorithm to generate the word vector and the vector representation of the sentences in different ways. Third, we classified the Weibo posts using sentence vectors generated by the Sent2Vec algorithm. Finally we comprehensively evaluated the advantages and disadvantages of the three methods. [Results] Both Word2Vec and Sent2Vec algorithms could reduce the text features significantly. We used 30,000 words as features and found Word2Vec and Sent2Vec algorithms could reduce feature numbers to less than 1000. The classification accuracy rate of the Word2Vec algorithm was 75.14%, which was 3% lower than the baseline. The accuracy rate of the Sent2Vec algorithm was far less than the other two methods, with the accuracy rate was only 63.08%. [Limitations] The corpus size of this paper needs to be expanded. We found that the Word2Vec algorithm did not have enough semantic information to calculate word vector. However, Sent2Vec has poor classification results for Chinese sentence vectors. [Conclusions] Word2Vec algorithm is suitable for large-scale corpus classification, and words should be used as classification features for lack of text.
李心蕾, 王昊, 刘小敏, 邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification. Data Analysis and Knowledge Discovery, 2018, 2(8): 41-50.
(2017 Report of Weibo Users[EB/OL]. [2017-12-25]. .)
[2]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. ePrint arXiv, arXiv:1301.3781v3.
[3]
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[OL]. ePrint arXiv, arXiv: 1310.4546.
[4]
Song X, He X, Gao J, et al.Unsupervised Learning of Word Semantic Embedding Using the Deep Structured Semantic Model[R]. Microsoft Research.MSR-TR-2014-109.
(Wang Zheng, Liu Shipei, Peng Yanbing.An Essay Context Recognition Model Based on Syntax Decision Tree and SVM Algorithm[J].Computer and Modernization, 2017(3): 13-17.)
(Guo Dongliang, Liu Xiaoming, Zheng Qiusheng.Internet Short-text Classification Method Based on CNNs[J]. Computer and Modernization, 2017(4): 78-81.)
doi: 10.3969/j.issn.1006-2475.2017.04.016
(Song Qian, Wang Dongming.Text Classification Algorithm Based on Genetic Algorithm and Probability Theory[J].Computer & Telecommunication, 2015(3): 49-52.)
doi: 10.3969/j.issn.1008-6609.2015.03.022
(Yin Fang, Zheng Liang, Chen Tiantian.Chinese Text Localization Based on Adaboost Algorithm in Natural Images[J].Computer Engineering and Applications, 2017, 53(4): 200-204.)
doi: 10.3778/j.issn.1002-8331.1506-0160
[9]
Liu P, Qiu X, Huang X.Adversarial Multi-task Learning for Text Classification[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017.
(Wang Rihong, Cui Xingmei, Zhou Wei, et al.Research of Text Sentiment Classification Based on Improved Semantic Comprehension[J]. Computer Science, 2017, 44(11A): 92-97.)
[11]
周志华. 聊天系统文本情感细粒度分类研究与应用[D]. 成都: 西南交通大学, 2015.
[11]
(Zhou Zhihua.Research and Application on Sentimental Fine-grained Classification of Text for Chat System[D]. Chengdu: Southwest Jiaotong University, 2015.)
(Wang Hao, Deng Sanhong, Su Xinning.Research on the Optimization of Chinese Character Features in the Automatic Classification of Chinese Short-text[J]. Information Studies: Theory & Application, 2015, 38(6): 121-127.)
doi: 10.16353/j.cnki.1000-7490.2015.06.024
(He Keda, Zhu Zhengtao, Cheng Yu.A Research on Text Classification Method Based on Improved TF-IDF Algorithm[J]. Journal of Guangdong University of Technology, 2016, 33(5): 49-53.)
doi: 10.3969/j.issn.1007-7162.2016.05.009
[14]
陈磊. 文本表示模型和特征选择算法研究[D]. 合肥: 中国科学技术大学, 2017.
[14]
(Chen Lei.Text Representation Model and Feature Selection Algorithm[D]. Hefei: University of Science and Technology of China, 2017.)
[15]
李岩. 基于深度学习的短文本分析与计算方法研究[D]. 北京: 北京科技大学, 2016.
[15]
(Li Yan.Research on Analysis and Computation Methods for Short Text with Deep Learning [D]. Beijing: University of Science and Technology Beijing, 2016.)
(Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short Text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
(Li Xiangdong, Cao Huan, Ding Cong, et al.Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. New Technology of Library and Information Service, 2015(2): 31-38.)
(Yang Tianping, Zhu Zhengyu.Algorithm for Chinese Short-Text Classification Using Concept Description[J]. Journal of Computer Applications, 2012, 32(12): 3335-3338.)
doi: 10.3724/SP.J.1087.2012.03335
[19]
Mikolov T, Zweig G.Context Dependent Recurrent Neural Network Language Model[C]//Proceedings of the 2012 IEEE Spoken Language Technology Workshop, 2013, 8537(11): 234-239.
[20]
江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[20]
(Jiang Dapeng.Research on Short Text Classification Based on Word Distributed Representation[D]. Hangzhou: Zhejiang University, 2015.)
[21]
董文. 基于LDA和Word2Vec的推荐算法研究[D]. 北京: 北京邮电大学, 2015.
[21]
(Dong Wen.Research of Recommendation Algorithm Based on LDA and Word2Vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
(Zheng Wenchao, Xu Peng.Research on Chinese Word Clustering with Word2Vec[J]. Software, 2013, 34(12): 160-162.)
doi: 10.3969/j.issn.1003-6970.2013.12.040
(Zhou Lian.Exploration of the Working Principle and Application of Word2Vec[J]. Sci-Tech Information Development and Economy, 2015, 25(2): 145-148.)
doi: 10.3969/j.issn.1005-6033.2015.02.061
[24]
Mikolov T, Yih W T, Zweig G.Linguistic Regularities in Continuous Space Word Representations[C]//Proceedings of the 2013 NAACL-HLT.2013.
[25]
Levy O, Goldberg Y, Dagan I.Improving Distributional Similarity with Lessons Learned from Word Embeddings[J]. Transactions of the Association for Computational Linguistics, 2015, 3: 211-225.
doi: 10.1080/00378941.1928.10836296
[26]
Shen Y, He X, Gao J, et al.A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. 2014.
[27]
Gao J, Deng L, Gamon M, et al. Modeling Interestingness with Deep Neural Networks[OL]. United States Patent 9846836. .
[28]
NLPIR汉语分词系统[CP/OL]. [2013-04-17]. .
[28]
(NLPIR Chinese Word Segmentation System[CP/OL]. [2013-04-17].
[29]
Huang P, He X, Gao J, et al. Deep Structured Semantic Model Produced Using Click-Through Data[OL]. United States Patent Application 20150074027. .
[30]
LIBSVM[CP/OL]. [2016-12-22]..
[31]
Word2Vec 0.9.2[CP/OL]. [2017-09-19]. .
[32]
Faruqui M, Dyer C.Non-distributional Word Vector Representations[OL]. ePrint arXiv, arXiv: 1506.05230.
(Zhang Qian, Gao Zhangmin, Liu Jiayong.Research of Weibo Short Text Classification Based on Word2Vec[J]. Netinfo Security, 2017(1): 57-62.)
doi: 10.3969/j.issn.1671-1122.2017.01.009
[34]
Rong X.Word2Vec Parameter Learning Explained[OL]. ePrint arXiv, arXiv: 1411.2738.
[35]
Sent2Vec[CP/OL]. [2015-07-28]. .
[36]
Fang A, Macdonald C, Ounis I, et al.Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016.
[37]
Wang H, Deng S.A Paper-Text Perspective: Studies on the Influence of Feature Granularity for Chinese Short-Text-Classification in the Big Data Era[J]. Electronic Library, 2017, 35(11): 689-708.
doi: 10.1108/EL-09-2016-0192
(Bai Shuxia, Bao Yulai, Zhang Hui.Automatic Summarization Based on Bag of Word Vector[J]. Journal of Modern Information, 2017, 37(2): 8-13.)
doi: 10.3969/j.issn.1008-0821.2017.02.002
[39]
Jastrzebski S, Leśniak D, Czarnecki W M.How to Evaluate Word Embeddings? On Importance of Data Efficiency and Simple Supervised Tasks[OL]. ePrint arXiv, arXiv: 1702.02170.
[40]
Yaghoobzadeh Y, Schütze H.Intrinsic Subspace Evaluation of Word Embedding Representations[OL]. DOI: 10.18653/v1/P16-1023.
[41]
Linzen T.Issues in Evaluating Semantic Spaces Using Word Analogies[OL]. ePrint arXiv, arXiv: 1606.07736.
[42]
Blair P, Merhav Y, Barry J.Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations[OL]. ePrint arXiv, arXiv: 1611.01547.
[43]
Rekabsaz N, Lupu M, Hanbury A.Uncertainty in Neural Network Word Embedding: Exploration of Threshold for Similarity[OL]. ePrint arXiv, arXiv: 1606.06086.
[44]
Zhou G, Huang J.Modeling and Learning Continuous Word Embedding with Metadata for Question Retrieval[J]. IEEE Transactions on Knowledge & Data Engineering, 2017, 29(6):1226-1239.
doi: 10.1109/TKDE.2017.2665625
(Zhang Sujuan, Zheng Qinghua, Hu Yunhua, et al.A Novel Algorithm of Eliminating the Chinese Word Segmentation Ambiguities for Web Answer[J]. Computer Engineering & Applications, 2004, 40(25): 55-58.)
doi: 10.3321/j.issn:1002-8331.2004.25.017
[46]
Goldwater S, Griffiths T L, Johnson M.A Bayesian Framework for Word Segmentation: Exploring the Effects of Context[J]. Cognition, 2009, 112(1): 21-54.
doi: 10.1016/j.cognition.2009.03.008
pmid: 19409539
(Li Xiangdong, Gao Fan, Ding Cong.Study on Influences of Different Chinese Word Segmentation Methods to Text Automatic Classification Based on LDA Model[J]. Application Research of Computers, 2017, 34(1): 62-66.)
[48]
Mrkšić N, Vulić I, Séaghdha D Ó, et al.Semantic Specialisation of Distributional Word Vector Spaces Using Monolingual and Cross-Lingual Constraints[OL]. ePrint arXiv, arXiv: 1706.00374.