Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (8): 41-50    DOI: 10.11925/infotech.2096-3467.2018.0322
Current Issue | Archive | Adv Search |
Comparing Text Vector Generators for Weibo Short Text Classification
Li Xinlei, Wang Hao(), Liu Xiaomin, Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (602 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper uses the Word2Vec and Sent2Vec algorithms to generate vectors for the text posts of Sina Weibo, aiming to achieve lower computational cost and higher efficiency in text classification. [Methods] First, we classified words from the posts with the 0-1 matrix and used results as the baseline. Then, we used the Word2Vec algorithm to generate the word vector and the vector representation of the sentences in different ways. Third, we classified the Weibo posts using sentence vectors generated by the Sent2Vec algorithm. Finally we comprehensively evaluated the advantages and disadvantages of the three methods. [Results] Both Word2Vec and Sent2Vec algorithms could reduce the text features significantly. We used 30,000 words as features and found Word2Vec and Sent2Vec algorithms could reduce feature numbers to less than 1000. The classification accuracy rate of the Word2Vec algorithm was 75.14%, which was 3% lower than the baseline. The accuracy rate of the Sent2Vec algorithm was far less than the other two methods, with the accuracy rate was only 63.08%. [Limitations] The corpus size of this paper needs to be expanded. We found that the Word2Vec algorithm did not have enough semantic information to calculate word vector. However, Sent2Vec has poor classification results for Chinese sentence vectors. [Conclusions] Word2Vec algorithm is suitable for large-scale corpus classification, and words should be used as classification features for lack of text.

Key wordsShort Text Classification      Word2Vec      Colloquial Text      Word Vector Composition      Sentence Vector     
Received: 23 March 2018      Published: 08 September 2018
ZTFLH:  TP393 G350  

Cite this article:

Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification. Data Analysis and Knowledge Discovery, 2018, 2(8): 41-50.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0322     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I8/41

类目编号 类目 数量 类目编号 类目 数量
0 财经 2 225 5 情感 2 052
1 读书 2 177 6 数码 2 114
2 军事 2 098 7 校园 2 400
3 旅游 2 091 8 养生 2 056
4 美食 2 107 9 游戏 2 327
训练集 测试集 特征数 准确率
19 293 2 140 34 378 0.783 2
类目 数量 准确率 召回率 F1值
财经 223 0.8894 0.8296 0.8585
读书 217 0.7327 0.6820 0.7064
军事 207 0.9196 0.8841 0.9015
旅游 209 0.7744 0.7225 0.7476
美食 209 0.8318 0.8756 0.8531
情感 198 0.5428 0.8333 0.6574
数码 209 0.8418 0.7129 0.7720
校园 239 0.8421 0.8033 0.8222
养生 202 0.6852 0.6386 0.6611
游戏 227 0.8761 0.8414 0.8584
维度 全部词向量累加 去重后的词向量
累加
去重后的词向量取平均
700 74.64% 74.72% 75.14%
距离 距离
郑州 0.9539 广州 0.9244
青岛 0.9499 长沙 0.9194
上海 0.9483 桃源 0.9193
西安 0.9390 唐山 0.9143
河南 0.9254 南京 0.9137
[1] 2017微博用户发展报告[EB/OL]. [2017-12-25]. .
[1] (2017 Report of Weibo Users[EB/OL]. [2017-12-25]. .)
[2] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. ePrint arXiv, arXiv:1301.3781v3.
[3] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[OL]. ePrint arXiv, arXiv: 1310.4546.
[4] Song X, He X, Gao J, et al.Unsupervised Learning of Word Semantic Embedding Using the Deep Structured Semantic Model[R]. Microsoft Research.MSR-TR-2014-109.
[5] 王峥, 刘师培, 彭艳兵. 基于句法决策树和SVM的短文本语境识别模型[J]. 计算机与现代化, 2017(3): 13-17.
[5] (Wang Zheng, Liu Shipei, Peng Yanbing.An Essay Context Recognition Model Based on Syntax Decision Tree and SVM Algorithm[J].Computer and Modernization, 2017(3): 13-17.)
[6] 郭东亮, 刘小明, 郑秋生. 基于卷积神经网络的互联网短文本分类方法[J]. 计算机与现代化, 2017(4): 78-81.
doi: 10.3969/j.issn.1006-2475.2017.04.016
[6] (Guo Dongliang, Liu Xiaoming, Zheng Qiusheng.Internet Short-text Classification Method Based on CNNs[J]. Computer and Modernization, 2017(4): 78-81.)
doi: 10.3969/j.issn.1006-2475.2017.04.016
[7] 宋倩, 王东明. 基于遗传算法及概率论的文本分类算法[J]. 电脑与电信, 2015(3): 49-52.
doi: 10.3969/j.issn.1008-6609.2015.03.022
[7] (Song Qian, Wang Dongming.Text Classification Algorithm Based on Genetic Algorithm and Probability Theory[J].Computer & Telecommunication, 2015(3): 49-52.)
doi: 10.3969/j.issn.1008-6609.2015.03.022
[8] 尹芳, 郑亮, 陈田田. 基于Adaboost算法的场景中文文本定位[J]. 计算机工程与应用, 2017, 53(4): 200-204.
doi: 10.3778/j.issn.1002-8331.1506-0160
[8] (Yin Fang, Zheng Liang, Chen Tiantian.Chinese Text Localization Based on Adaboost Algorithm in Natural Images[J].Computer Engineering and Applications, 2017, 53(4): 200-204.)
doi: 10.3778/j.issn.1002-8331.1506-0160
[9] Liu P, Qiu X, Huang X.Adversarial Multi-task Learning for Text Classification[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017.
[10] 王日宏, 崔兴梅, 周炜, 等. 改进的基于语义理解的文本情感分类方法研究[J]. 计算机科学, 2017, 44(11A): 92-97.
[10] (Wang Rihong, Cui Xingmei, Zhou Wei, et al.Research of Text Sentiment Classification Based on Improved Semantic Comprehension[J]. Computer Science, 2017, 44(11A): 92-97.)
[11] 周志华. 聊天系统文本情感细粒度分类研究与应用[D]. 成都: 西南交通大学, 2015.
[11] (Zhou Zhihua.Research and Application on Sentimental Fine-grained Classification of Text for Chat System[D]. Chengdu: Southwest Jiaotong University, 2015.)
[12] 王昊, 邓三鸿, 苏新宁. 中文短文本自动分类中的汉字特征优化研究[J]. 情报理论与实践, 2015, 38(6): 121-127.
doi: 10.16353/j.cnki.1000-7490.2015.06.024
[12] (Wang Hao, Deng Sanhong, Su Xinning.Research on the Optimization of Chinese Character Features in the Automatic Classification of Chinese Short-text[J]. Information Studies: Theory & Application, 2015, 38(6): 121-127.)
doi: 10.16353/j.cnki.1000-7490.2015.06.024
[13] 贺科达, 朱铮涛, 程昱. 基于改进TF-IDF算法的文本分类方法研究[J]. 广东工业大学学报, 2016, 33(5): 49-53.
doi: 10.3969/j.issn.1007-7162.2016.05.009
[13] (He Keda, Zhu Zhengtao, Cheng Yu.A Research on Text Classification Method Based on Improved TF-IDF Algorithm[J]. Journal of Guangdong University of Technology, 2016, 33(5): 49-53.)
doi: 10.3969/j.issn.1007-7162.2016.05.009
[14] 陈磊. 文本表示模型和特征选择算法研究[D]. 合肥: 中国科学技术大学, 2017.
[14] (Chen Lei.Text Representation Model and Feature Selection Algorithm[D]. Hefei: University of Science and Technology of China, 2017.)
[15] 李岩. 基于深度学习的短文本分析与计算方法研究[D]. 北京: 北京科技大学, 2016.
[15] (Li Yan.Research on Analysis and Computation Methods for Short Text with Deep Learning [D]. Beijing: University of Science and Technology Beijing, 2016.)
[16] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48.
[16] (Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short Text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[17] 李湘东, 曹环, 丁丛, 等. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015(2): 31-38.
[17] (Li Xiangdong, Cao Huan, Ding Cong, et al.Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. New Technology of Library and Information Service, 2015(2): 31-38.)
[18] 杨天平, 朱征宇. 使用概念描述的中文短文本分类算法[J]. 计算机应用, 2012, 32(12): 3335-3338.
doi: 10.3724/SP.J.1087.2012.03335
[18] (Yang Tianping, Zhu Zhengyu.Algorithm for Chinese Short-Text Classification Using Concept Description[J]. Journal of Computer Applications, 2012, 32(12): 3335-3338.)
doi: 10.3724/SP.J.1087.2012.03335
[19] Mikolov T, Zweig G.Context Dependent Recurrent Neural Network Language Model[C]//Proceedings of the 2012 IEEE Spoken Language Technology Workshop, 2013, 8537(11): 234-239.
[20] 江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[20] (Jiang Dapeng.Research on Short Text Classification Based on Word Distributed Representation[D]. Hangzhou: Zhejiang University, 2015.)
[21] 董文. 基于LDA和Word2Vec的推荐算法研究[D]. 北京: 北京邮电大学, 2015.
[21] (Dong Wen.Research of Recommendation Algorithm Based on LDA and Word2Vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[22] 郑文超, 徐鹏. 利用Word2Vec对中文词进行聚类的研究[J]. 软件, 2013, 34(12): 160-162.
doi: 10.3969/j.issn.1003-6970.2013.12.040
[22] (Zheng Wenchao, Xu Peng.Research on Chinese Word Clustering with Word2Vec[J]. Software, 2013, 34(12): 160-162.)
doi: 10.3969/j.issn.1003-6970.2013.12.040
[23] 周练. Word2Vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015, 25(2): 145-148.
doi: 10.3969/j.issn.1005-6033.2015.02.061
[23] (Zhou Lian.Exploration of the Working Principle and Application of Word2Vec[J]. Sci-Tech Information Development and Economy, 2015, 25(2): 145-148.)
doi: 10.3969/j.issn.1005-6033.2015.02.061
[24] Mikolov T, Yih W T, Zweig G.Linguistic Regularities in Continuous Space Word Representations[C]//Proceedings of the 2013 NAACL-HLT.2013.
[25] Levy O, Goldberg Y, Dagan I.Improving Distributional Similarity with Lessons Learned from Word Embeddings[J]. Transactions of the Association for Computational Linguistics, 2015, 3: 211-225.
doi: 10.1080/00378941.1928.10836296
[26] Shen Y, He X, Gao J, et al.A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. 2014.
[27] Gao J, Deng L, Gamon M, et al. Modeling Interestingness with Deep Neural Networks[OL]. United States Patent 9846836. .
[28] NLPIR汉语分词系统[CP/OL]. [2013-04-17]. .
[28] (NLPIR Chinese Word Segmentation System[CP/OL]. [2013-04-17].
[29] Huang P, He X, Gao J, et al. Deep Structured Semantic Model Produced Using Click-Through Data[OL]. United States Patent Application 20150074027. .
[30] LIBSVM[CP/OL]. [2016-12-22]..
[31] Word2Vec 0.9.2[CP/OL]. [2017-09-19]. .
[32] Faruqui M, Dyer C.Non-distributional Word Vector Representations[OL]. ePrint arXiv, arXiv: 1506.05230.
[33] 张谦, 高章敏, 刘嘉勇. 基于Word2Vec的微博短文本分类研究[J]. 信息网络安全, 2017(1): 57-62.
doi: 10.3969/j.issn.1671-1122.2017.01.009
[33] (Zhang Qian, Gao Zhangmin, Liu Jiayong.Research of Weibo Short Text Classification Based on Word2Vec[J]. Netinfo Security, 2017(1): 57-62.)
doi: 10.3969/j.issn.1671-1122.2017.01.009
[34] Rong X.Word2Vec Parameter Learning Explained[OL]. ePrint arXiv, arXiv: 1411.2738.
[35] Sent2Vec[CP/OL]. [2015-07-28]. .
[36] Fang A, Macdonald C, Ounis I, et al.Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016.
[37] Wang H, Deng S.A Paper-Text Perspective: Studies on the Influence of Feature Granularity for Chinese Short-Text-Classification in the Big Data Era[J]. Electronic Library, 2017, 35(11): 689-708.
doi: 10.1108/EL-09-2016-0192
[38] 白淑霞, 鲍玉来, 张晖. 基于词向量包的自动文摘方法[J]. 现代情报, 2017, 37(2): 8-13.
doi: 10.3969/j.issn.1008-0821.2017.02.002
[38] (Bai Shuxia, Bao Yulai, Zhang Hui.Automatic Summarization Based on Bag of Word Vector[J]. Journal of Modern Information, 2017, 37(2): 8-13.)
doi: 10.3969/j.issn.1008-0821.2017.02.002
[39] Jastrzebski S, Leśniak D, Czarnecki W M.How to Evaluate Word Embeddings? On Importance of Data Efficiency and Simple Supervised Tasks[OL]. ePrint arXiv, arXiv: 1702.02170.
[40] Yaghoobzadeh Y, Schütze H.Intrinsic Subspace Evaluation of Word Embedding Representations[OL]. DOI: 10.18653/v1/P16-1023.
[41] Linzen T.Issues in Evaluating Semantic Spaces Using Word Analogies[OL]. ePrint arXiv, arXiv: 1606.07736.
[42] Blair P, Merhav Y, Barry J.Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations[OL]. ePrint arXiv, arXiv: 1611.01547.
[43] Rekabsaz N, Lupu M, Hanbury A.Uncertainty in Neural Network Word Embedding: Exploration of Threshold for Similarity[OL]. ePrint arXiv, arXiv: 1606.06086.
[44] Zhou G, Huang J.Modeling and Learning Continuous Word Embedding with Metadata for Question Retrieval[J]. IEEE Transactions on Knowledge & Data Engineering, 2017, 29(6):1226-1239.
doi: 10.1109/TKDE.2017.2665625
[45] 张素娟, 郑庆华, 胡云华, 等. 一种面向网络答疑的汉语切分歧义消除算法[J]. 计算机工程与应用, 2004, 40(25): 55-58.
doi: 10.3321/j.issn:1002-8331.2004.25.017
[45] (Zhang Sujuan, Zheng Qinghua, Hu Yunhua, et al.A Novel Algorithm of Eliminating the Chinese Word Segmentation Ambiguities for Web Answer[J]. Computer Engineering & Applications, 2004, 40(25): 55-58.)
doi: 10.3321/j.issn:1002-8331.2004.25.017
[46] Goldwater S, Griffiths T L, Johnson M.A Bayesian Framework for Word Segmentation: Exploring the Effects of Context[J]. Cognition, 2009, 112(1): 21-54.
doi: 10.1016/j.cognition.2009.03.008 pmid: 19409539
[47] 李湘东, 高凡, 丁丛. LDA 模型下不同分词方法对文本分类性能的影响研究[J]. 计算机应用研究, 2017, 34(1): 62-66.
[47] (Li Xiangdong, Gao Fan, Ding Cong.Study on Influences of Different Chinese Word Segmentation Methods to Text Automatic Classification Based on LDA Model[J]. Application Research of Computers, 2017, 34(1): 62-66.)
[48] Mrkšić N, Vulić I, Séaghdha D Ó, et al.Semantic Specialisation of Distributional Word Vector Spaces Using Monolingual and Cross-Lingual Constraints[OL]. ePrint arXiv, arXiv: 1706.00374.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Li Yueyan,Xiong Huixiang,Li Xiaomin. Recommending Doctors Online Based on Combined Conditions[J]. 数据分析与知识发现, 2020, 4(8): 130-142.
[3] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[4] Ye Jiaxin,Xiong Huixiang,Tong Zhaoli,Meng Qiuqing. Collaborative Tagging for Doctors in Online Medical Community[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[5] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[6] Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman. Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[7] Ye Jiaxin,Xiong Huixiang,Jiang Wuxuan. A Physician Recommendation Algorithm Integrating Inquiries and Decisions of Patients[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[8] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[9] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[10] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[12] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[13] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[14] Zhiyong Tao,Xiaobing Li,Ying Liu,Xiaofang Liu. Classifying Short Texts with Improved-Attention Based Bidirectional Long Memory Network[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[15] Gao Yongbing,Yang Guipeng,Zhang Di,Ma Zhanfei. Detecting Events from Official Weibo Profiles Based on Post Clustering with Burst Words[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn