Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (8): 41-50     https://doi.org/10.11925/infotech.2096-3467.2018.0322
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向微博短文本分类的文本向量化方法比较研究*
李心蕾, 王昊(), 刘小敏, 邓三鸿
南京大学信息管理学院 南京 210023
江苏省数据工程与知识服务重点实验室 南京 210023
Comparing Text Vector Generators for Weibo Short Text Classification
Li Xinlei, Wang Hao(), Liu Xiaomin, Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF (602 KB)   HTML ( 14
输出: BibTeX | EndNote (RIS)      
摘要 

目的】利用Word2Vec和Sent2Vec算法生成新浪微博的文本的向量化表示形式, 以期在文本分类时获得较低的计算成本和较高的分类效果。【方法】使用文本中词的0-1矩阵进行分类, 将分类效果作为基准线; 采用Word2Vec算法生成词向量并用不同方式合成句子的向量表示, 进行文本分类, 并与基准线进行对比; 利用Sent2Vec算法直接生成句子向量进行分类, 综合评价3种方法的优缺点。【结果】研究显示使用Word2Vec算法和Sent2Vec算法能够极大程度上压缩文本特征, 对比于使用所有3万多个词作为特征, Word2Vec算法和Sent2Vec算法将特征数压缩在1 000以内。在分类准确率方面, Word2Vec算法的分类准确率比基准线低约3%, 准确率为75.14%。Sent2Vec算法的分类效果远不如其他两种方法, 准确率只有63.08%。【局限】由于语料有限, Word2Vec算法在计算词向量时可能缺少足够的语义信息, 导致词向量的准确性不高, 而Sent2Vec算法在中文文本语境下生成句向量的分类结果较差。【结论】Word2Vec算法更适用大规模语料文本分类, 在文本量较少时应使用词为特征分类。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李心蕾
王昊
刘小敏
邓三鸿
关键词 短文本分类Word2Vec口语化文本词向量合成句向量    
Abstract

[Objective] This paper uses the Word2Vec and Sent2Vec algorithms to generate vectors for the text posts of Sina Weibo, aiming to achieve lower computational cost and higher efficiency in text classification. [Methods] First, we classified words from the posts with the 0-1 matrix and used results as the baseline. Then, we used the Word2Vec algorithm to generate the word vector and the vector representation of the sentences in different ways. Third, we classified the Weibo posts using sentence vectors generated by the Sent2Vec algorithm. Finally we comprehensively evaluated the advantages and disadvantages of the three methods. [Results] Both Word2Vec and Sent2Vec algorithms could reduce the text features significantly. We used 30,000 words as features and found Word2Vec and Sent2Vec algorithms could reduce feature numbers to less than 1000. The classification accuracy rate of the Word2Vec algorithm was 75.14%, which was 3% lower than the baseline. The accuracy rate of the Sent2Vec algorithm was far less than the other two methods, with the accuracy rate was only 63.08%. [Limitations] The corpus size of this paper needs to be expanded. We found that the Word2Vec algorithm did not have enough semantic information to calculate word vector. However, Sent2Vec has poor classification results for Chinese sentence vectors. [Conclusions] Word2Vec algorithm is suitable for large-scale corpus classification, and words should be used as classification features for lack of text.

Key wordsShort Text Classification    Word2Vec    Colloquial Text    Word Vector Composition    Sentence Vector
收稿日期: 2018-03-23      出版日期: 2018-09-08
ZTFLH:  TP393 G350  
基金资助:*本文系国家自然科学基金项目“面向学术资源的TSD与TDC测度及分析研究”(项目编号: 71503121)和“江苏青年社科英才”人才培养项目的研究成果之一
引用本文:   
李心蕾, 王昊, 刘小敏, 邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification. Data Analysis and Knowledge Discovery, 2018, 2(8): 41-50.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0322      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I8/41
  研究框架
类目编号 类目 数量 类目编号 类目 数量
0 财经 2 225 5 情感 2 052
1 读书 2 177 6 数码 2 114
2 军事 2 098 7 校园 2 400
3 旅游 2 091 8 养生 2 056
4 美食 2 107 9 游戏 2 327
  搜集到的热门微博数据分布
训练集 测试集 特征数 准确率
19 293 2 140 34 378 0.783 2
  词为特征one-hot矩阵分类结果
类目 数量 准确率 召回率 F1值
财经 223 0.8894 0.8296 0.8585
读书 217 0.7327 0.6820 0.7064
军事 207 0.9196 0.8841 0.9015
旅游 209 0.7744 0.7225 0.7476
美食 209 0.8318 0.8756 0.8531
情感 198 0.5428 0.8333 0.6574
数码 209 0.8418 0.7129 0.7720
校园 239 0.8421 0.8033 0.8222
养生 202 0.6852 0.6386 0.6611
游戏 227 0.8761 0.8414 0.8584
  10个类分类指标统计
  维度数1 000-2 500的分类准确率
  维度数在1 000以内的分类准确率
维度 全部词向量累加 去重后的词向量
累加
去重后的词向量取平均
700 74.64% 74.72% 75.14%
  不同词向量合成方法的分类结果
  句向量分类结果
  不同特征矩阵获取方式的分类结果
  开放数据分类结果
距离 距离
郑州 0.9539 广州 0.9244
青岛 0.9499 长沙 0.9194
上海 0.9483 桃源 0.9193
西安 0.9390 唐山 0.9143
河南 0.9254 南京 0.9137
  与“北京”相似度最高的10个词
[1] 2017微博用户发展报告[EB/OL]. [2017-12-25]. .
[1] (2017 Report of Weibo Users[EB/OL]. [2017-12-25]. .)
[2] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. ePrint arXiv, arXiv:1301.3781v3.
[3] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[OL]. ePrint arXiv, arXiv: 1310.4546.
[4] Song X, He X, Gao J, et al.Unsupervised Learning of Word Semantic Embedding Using the Deep Structured Semantic Model[R]. Microsoft Research.MSR-TR-2014-109.
[5] 王峥, 刘师培, 彭艳兵. 基于句法决策树和SVM的短文本语境识别模型[J]. 计算机与现代化, 2017(3): 13-17.
[5] (Wang Zheng, Liu Shipei, Peng Yanbing.An Essay Context Recognition Model Based on Syntax Decision Tree and SVM Algorithm[J].Computer and Modernization, 2017(3): 13-17.)
[6] 郭东亮, 刘小明, 郑秋生. 基于卷积神经网络的互联网短文本分类方法[J]. 计算机与现代化, 2017(4): 78-81.
doi: 10.3969/j.issn.1006-2475.2017.04.016
[6] (Guo Dongliang, Liu Xiaoming, Zheng Qiusheng.Internet Short-text Classification Method Based on CNNs[J]. Computer and Modernization, 2017(4): 78-81.)
doi: 10.3969/j.issn.1006-2475.2017.04.016
[7] 宋倩, 王东明. 基于遗传算法及概率论的文本分类算法[J]. 电脑与电信, 2015(3): 49-52.
doi: 10.3969/j.issn.1008-6609.2015.03.022
[7] (Song Qian, Wang Dongming.Text Classification Algorithm Based on Genetic Algorithm and Probability Theory[J].Computer & Telecommunication, 2015(3): 49-52.)
doi: 10.3969/j.issn.1008-6609.2015.03.022
[8] 尹芳, 郑亮, 陈田田. 基于Adaboost算法的场景中文文本定位[J]. 计算机工程与应用, 2017, 53(4): 200-204.
doi: 10.3778/j.issn.1002-8331.1506-0160
[8] (Yin Fang, Zheng Liang, Chen Tiantian.Chinese Text Localization Based on Adaboost Algorithm in Natural Images[J].Computer Engineering and Applications, 2017, 53(4): 200-204.)
doi: 10.3778/j.issn.1002-8331.1506-0160
[9] Liu P, Qiu X, Huang X.Adversarial Multi-task Learning for Text Classification[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017.
[10] 王日宏, 崔兴梅, 周炜, 等. 改进的基于语义理解的文本情感分类方法研究[J]. 计算机科学, 2017, 44(11A): 92-97.
[10] (Wang Rihong, Cui Xingmei, Zhou Wei, et al.Research of Text Sentiment Classification Based on Improved Semantic Comprehension[J]. Computer Science, 2017, 44(11A): 92-97.)
[11] 周志华. 聊天系统文本情感细粒度分类研究与应用[D]. 成都: 西南交通大学, 2015.
[11] (Zhou Zhihua.Research and Application on Sentimental Fine-grained Classification of Text for Chat System[D]. Chengdu: Southwest Jiaotong University, 2015.)
[12] 王昊, 邓三鸿, 苏新宁. 中文短文本自动分类中的汉字特征优化研究[J]. 情报理论与实践, 2015, 38(6): 121-127.
doi: 10.16353/j.cnki.1000-7490.2015.06.024
[12] (Wang Hao, Deng Sanhong, Su Xinning.Research on the Optimization of Chinese Character Features in the Automatic Classification of Chinese Short-text[J]. Information Studies: Theory & Application, 2015, 38(6): 121-127.)
doi: 10.16353/j.cnki.1000-7490.2015.06.024
[13] 贺科达, 朱铮涛, 程昱. 基于改进TF-IDF算法的文本分类方法研究[J]. 广东工业大学学报, 2016, 33(5): 49-53.
doi: 10.3969/j.issn.1007-7162.2016.05.009
[13] (He Keda, Zhu Zhengtao, Cheng Yu.A Research on Text Classification Method Based on Improved TF-IDF Algorithm[J]. Journal of Guangdong University of Technology, 2016, 33(5): 49-53.)
doi: 10.3969/j.issn.1007-7162.2016.05.009
[14] 陈磊. 文本表示模型和特征选择算法研究[D]. 合肥: 中国科学技术大学, 2017.
[14] (Chen Lei.Text Representation Model and Feature Selection Algorithm[D]. Hefei: University of Science and Technology of China, 2017.)
[15] 李岩. 基于深度学习的短文本分析与计算方法研究[D]. 北京: 北京科技大学, 2016.
[15] (Li Yan.Research on Analysis and Computation Methods for Short Text with Deep Learning [D]. Beijing: University of Science and Technology Beijing, 2016.)
[16] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48.
[16] (Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short Text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[17] 李湘东, 曹环, 丁丛, 等. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015(2): 31-38.
[17] (Li Xiangdong, Cao Huan, Ding Cong, et al.Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. New Technology of Library and Information Service, 2015(2): 31-38.)
[18] 杨天平, 朱征宇. 使用概念描述的中文短文本分类算法[J]. 计算机应用, 2012, 32(12): 3335-3338.
doi: 10.3724/SP.J.1087.2012.03335
[18] (Yang Tianping, Zhu Zhengyu.Algorithm for Chinese Short-Text Classification Using Concept Description[J]. Journal of Computer Applications, 2012, 32(12): 3335-3338.)
doi: 10.3724/SP.J.1087.2012.03335
[19] Mikolov T, Zweig G.Context Dependent Recurrent Neural Network Language Model[C]//Proceedings of the 2012 IEEE Spoken Language Technology Workshop, 2013, 8537(11): 234-239.
[20] 江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[20] (Jiang Dapeng.Research on Short Text Classification Based on Word Distributed Representation[D]. Hangzhou: Zhejiang University, 2015.)
[21] 董文. 基于LDA和Word2Vec的推荐算法研究[D]. 北京: 北京邮电大学, 2015.
[21] (Dong Wen.Research of Recommendation Algorithm Based on LDA and Word2Vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[22] 郑文超, 徐鹏. 利用Word2Vec对中文词进行聚类的研究[J]. 软件, 2013, 34(12): 160-162.
doi: 10.3969/j.issn.1003-6970.2013.12.040
[22] (Zheng Wenchao, Xu Peng.Research on Chinese Word Clustering with Word2Vec[J]. Software, 2013, 34(12): 160-162.)
doi: 10.3969/j.issn.1003-6970.2013.12.040
[23] 周练. Word2Vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015, 25(2): 145-148.
doi: 10.3969/j.issn.1005-6033.2015.02.061
[23] (Zhou Lian.Exploration of the Working Principle and Application of Word2Vec[J]. Sci-Tech Information Development and Economy, 2015, 25(2): 145-148.)
doi: 10.3969/j.issn.1005-6033.2015.02.061
[24] Mikolov T, Yih W T, Zweig G.Linguistic Regularities in Continuous Space Word Representations[C]//Proceedings of the 2013 NAACL-HLT.2013.
[25] Levy O, Goldberg Y, Dagan I.Improving Distributional Similarity with Lessons Learned from Word Embeddings[J]. Transactions of the Association for Computational Linguistics, 2015, 3: 211-225.
doi: 10.1080/00378941.1928.10836296
[26] Shen Y, He X, Gao J, et al.A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. 2014.
[27] Gao J, Deng L, Gamon M, et al. Modeling Interestingness with Deep Neural Networks[OL]. United States Patent 9846836. .
[28] NLPIR汉语分词系统[CP/OL]. [2013-04-17]. .
[28] (NLPIR Chinese Word Segmentation System[CP/OL]. [2013-04-17].
[29] Huang P, He X, Gao J, et al. Deep Structured Semantic Model Produced Using Click-Through Data[OL]. United States Patent Application 20150074027. .
[30] LIBSVM[CP/OL]. [2016-12-22]..
[31] Word2Vec 0.9.2[CP/OL]. [2017-09-19]. .
[32] Faruqui M, Dyer C.Non-distributional Word Vector Representations[OL]. ePrint arXiv, arXiv: 1506.05230.
[33] 张谦, 高章敏, 刘嘉勇. 基于Word2Vec的微博短文本分类研究[J]. 信息网络安全, 2017(1): 57-62.
doi: 10.3969/j.issn.1671-1122.2017.01.009
[33] (Zhang Qian, Gao Zhangmin, Liu Jiayong.Research of Weibo Short Text Classification Based on Word2Vec[J]. Netinfo Security, 2017(1): 57-62.)
doi: 10.3969/j.issn.1671-1122.2017.01.009
[34] Rong X.Word2Vec Parameter Learning Explained[OL]. ePrint arXiv, arXiv: 1411.2738.
[35] Sent2Vec[CP/OL]. [2015-07-28]. .
[36] Fang A, Macdonald C, Ounis I, et al.Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016.
[37] Wang H, Deng S.A Paper-Text Perspective: Studies on the Influence of Feature Granularity for Chinese Short-Text-Classification in the Big Data Era[J]. Electronic Library, 2017, 35(11): 689-708.
doi: 10.1108/EL-09-2016-0192
[38] 白淑霞, 鲍玉来, 张晖. 基于词向量包的自动文摘方法[J]. 现代情报, 2017, 37(2): 8-13.
doi: 10.3969/j.issn.1008-0821.2017.02.002
[38] (Bai Shuxia, Bao Yulai, Zhang Hui.Automatic Summarization Based on Bag of Word Vector[J]. Journal of Modern Information, 2017, 37(2): 8-13.)
doi: 10.3969/j.issn.1008-0821.2017.02.002
[39] Jastrzebski S, Leśniak D, Czarnecki W M.How to Evaluate Word Embeddings? On Importance of Data Efficiency and Simple Supervised Tasks[OL]. ePrint arXiv, arXiv: 1702.02170.
[40] Yaghoobzadeh Y, Schütze H.Intrinsic Subspace Evaluation of Word Embedding Representations[OL]. DOI: 10.18653/v1/P16-1023.
[41] Linzen T.Issues in Evaluating Semantic Spaces Using Word Analogies[OL]. ePrint arXiv, arXiv: 1606.07736.
[42] Blair P, Merhav Y, Barry J.Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations[OL]. ePrint arXiv, arXiv: 1611.01547.
[43] Rekabsaz N, Lupu M, Hanbury A.Uncertainty in Neural Network Word Embedding: Exploration of Threshold for Similarity[OL]. ePrint arXiv, arXiv: 1606.06086.
[44] Zhou G, Huang J.Modeling and Learning Continuous Word Embedding with Metadata for Question Retrieval[J]. IEEE Transactions on Knowledge & Data Engineering, 2017, 29(6):1226-1239.
doi: 10.1109/TKDE.2017.2665625
[45] 张素娟, 郑庆华, 胡云华, 等. 一种面向网络答疑的汉语切分歧义消除算法[J]. 计算机工程与应用, 2004, 40(25): 55-58.
doi: 10.3321/j.issn:1002-8331.2004.25.017
[45] (Zhang Sujuan, Zheng Qinghua, Hu Yunhua, et al.A Novel Algorithm of Eliminating the Chinese Word Segmentation Ambiguities for Web Answer[J]. Computer Engineering & Applications, 2004, 40(25): 55-58.)
doi: 10.3321/j.issn:1002-8331.2004.25.017
[46] Goldwater S, Griffiths T L, Johnson M.A Bayesian Framework for Word Segmentation: Exploring the Effects of Context[J]. Cognition, 2009, 112(1): 21-54.
doi: 10.1016/j.cognition.2009.03.008 pmid: 19409539
[47] 李湘东, 高凡, 丁丛. LDA 模型下不同分词方法对文本分类性能的影响研究[J]. 计算机应用研究, 2017, 34(1): 62-66.
[47] (Li Xiangdong, Gao Fan, Ding Cong.Study on Influences of Different Chinese Word Segmentation Methods to Text Automatic Classification Based on LDA Model[J]. Application Research of Computers, 2017, 34(1): 62-66.)
[48] Mrkšić N, Vulić I, Séaghdha D Ó, et al.Semantic Specialisation of Distributional Word Vector Spaces Using Monolingual and Cross-Lingual Constraints[OL]. ePrint arXiv, arXiv: 1706.00374.
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 李跃艳,熊回香,李晓敏. 在线问诊平台中基于组合条件的医生推荐研究*[J]. 数据分析与知识发现, 2020, 4(8): 130-142.
[3] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[4] 叶佳鑫,熊回香,童兆莉,孟秋晴. 在线医疗社区中面向医生的协同标注研究*[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[5] 岳丽欣,刘自强,胡正银. 面向趋势预测的热点主题演化分析方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[6] 陶兴,张向先,郭顺利,张莉曼. 学术问答社区用户生成内容的W2V-MMR自动摘要方法研究*[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[7] 叶佳鑫,熊回香,蒋武轩. 一种融合患者咨询文本与决策机理的医生推荐算法*[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[8] 薛福亮,刘丽芳. 一种基于CRF与ATAE-LSTM的细粒度情感分析方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[9] 龚丽娟,王昊,张紫玄,朱立平. Word2Vec对海关报关商品文本特征降维效果分析*[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[10] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[12] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[13] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[14] 蒋翠清,郭轶博,刘尧. 基于中文社交媒体文本的领域情感词典构建方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[15] 陶志勇,李小兵,刘影,刘晓芳. 基于双向长短时记忆网络的改进注意力短文本分类方法 *[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn