Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. 29 Issue (2): 30-35    DOI: 10.11925/infotech.1003-3513.2013.02.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
一种基于半监督学习的短文本分类方法
张倩, 刘怀亮
西安电子科技大学经济与管理学院 西安 710071
An Algorithm of Short Text Classification Based on Semi-supervised Learning
Zhang Qian, Liu Huailiang
School of Economics and Management, Xidian University, Xi'an 710071, China
全文: PDF(878 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 针对短文本的特征词较少、信息关联性不强以及存在大量样本的标注瓶颈问题,传统的文本分类方法已不能较好地直接适用。将半监督学习思想引入到文本分类过程中,提出一种基于半监督学习的短文本分类方法,通过使用外部网络知识库来扩充短文本特征,构建基于半监督学习的分类模型,使用初始分类器进行迭代自学习实现训练样本中未标注部分的充分利用,从而解决标注瓶颈,提高分类器的性能。对比实验表明,该方法能够提升短文本分类的效果。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张倩
刘怀亮
关键词 半监督学习文本分类短文本自训练    
Abstract:According to the characteristics of short texts and the bottleneck problem of annotation in dealing with large numbers of unlabeled samples, traditional algorithms of text classification can not be used directly. This paper introduces a method of short text classification based on semi-supervised learning and builds a semi-supervised classification model. It is feasible to accomplish the self-training of the training samples and takes full advantages of the unlabeled parts of training texts by using the initial classifier. The bottleneck problem of annotation is solved and the good performance of classifier is shown. The contrast experiment shows that the algorithm of short text classification based on semi-supervised learning can get better classified effect.
Key wordsSemi-supervised learning    Text classification    Short text    Self-training
收稿日期: 2013-01-27     
:  TP391.1  
通讯作者: 张倩,zqvictory2011@yeah.net     E-mail: zqvictory2011@yeah.net
引用本文:   
张倩, 刘怀亮. 一种基于半监督学习的短文本分类方法[J]. 现代图书情报技术, 2013, 29(2): 30-35.
Zhang Qian, Liu Huailiang. An Algorithm of Short Text Classification Based on Semi-supervised Learning. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2013.02.05.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.02.05
[1] 蒲筱哥. Web自动文本分类技术研究综述[J]. 情报学报, 2009, 28(2): 233-241. (Pu Xiaoge. A Literature Review on Web Automated Text Categorization Technology[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(2): 233-241.)
[2] 苗夺谦, 卫志华. 中文文本信息处理的原理与应用[M]. 北京:清华大学出版社, 2007. (Miao Duoqian, Wei Zhihua. The Theory and Application for Chinese Text Information Processing[M]. Beijing: Tsinghua University Press, 2007. )
[3] 王细薇, 沈云琴. 中文短文本分类方法研究[J]. 现代计算机:专业版, 2010(7): 28-31. (Wang Xiwei, Shen Yunqin. Research on Chinese Short Text Classification Method[J]. Modern Computer, 2010(7): 28-31.)
[4] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3): 47-52. (Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.)
[5] Yih W T, Meek C. Improving Similarity Measures for Short Segments of Text[C]. In: Proceedings of the 22nd National Conference on Artificial Intelligence. 2007: 1489-1494.
[6] 林小俊, 张猛, 暴筱, 等. 基于概念网络的短文本分类方法[J]. 计算机工程, 2010, 36(21): 4-6. (Lin Xiaojun, Zhang Meng, Bao Xiao, et al. Short-text Classification Method Based on Concept Network[J]. Computer Engineering, 2010, 36(21): 4-6.)
[7] 蔡月红, 朱倩, 孙萍, 等. 基于属性选择的半监督短文本分类算法[J]. 计算机应用, 2010, 30(4): 1015-1018. (Cai Yuehong, Zhu Qian, Sun Ping, et al. Semi-supervised Short Text Categorization Based on Attribute Selection[J]. Journal of Computer Applications, 2010, 30(4): 1015-1018.)
[8] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145. (Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[9] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-606. (Wang Sheng, Fan Xinghua, Chen Xianlin. Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010, 30(3): 603-606.)
[10] 白秋产, 金春霞. 概念属性扩展的短文本聚类算法[J]. 长春师范学院学报, 2011, 30(5): 29-33. (Bai Qiuchan, Jin Chunxia. Short Text Clustering Algorithm Based on Concept Feature Expansion[J]. Journal of Changchun Normal University, 2011, 30(5): 29-33.)
[11] 史伟, 王洪伟, 何绍义. 基于微博平台的公众情感分析[J]. 情报学报, 2012, 31(11): 1171-1178. (Shi Wei, Wang Hongwei, He Shaoyi. Study on Public Sentiment Based on Microblogging Platform[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(11): 1171-1178.)
[12] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2007: 787-788.
[13] Day N E. Estimating the Components of a Mixture of Normal Distributions[J]. Biometrika, 1969, 56(3): 463-474.
[14] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society: Series B, 1977, 39(1): 1-38.
[15] Shahshanani B M, Landgrebe D A. The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon[J]. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(5): 1087-1095.
[16] 秦飞. 基于半监督学习的文本分类研究[D]. 成都:西南交通大学, 2010. (Qin Fei. Research on Document Classification Algorithm Based on Semi-supervised Learning[D]. Chengdu: Southwest Jiaotong University, 2010.)
[17] Nigam K, McCallum A, Mitchell T. Semi-supervised Text Classification Using EM[A]//Semi-supervised Learning[M]. Boston:MIT Press, 2006.
[18] 侯翠琴,焦李成. 基于图的Co-Training网页分类[J]. 电子学报, 2009, 37(10): 2173-2180. (Hou Cuiqin, Jiao Licheng. Graph Based Co-Training Algorithm for Web Page Classification[J]. Acta Electronica Sinica, 2009, 37(10): 2173-2180.)
[19] 郑海清, 林琛, 牛军钰. 一种基于紧密度的半监督文本分类方法[J]. 中文信息学报, 2007, 21(3): 54-60. (Zheng Haiqing, Lin Chen, Niu Junyu. A Closeness-based Semi-supervised Text Classification Method[J]. Journal of Chinese Information Processing, 2007, 21(3): 54-60.)
[20] Vapnik V N. Statistical Learning Theory[M]. Wiley-Interscience, 1998.
[21] [JP3]Blum A, Chawla S. Learning from Labeled and Unlabeled Data Using Graph Mincuts[C]. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, USA. 2001: 19-26.
[22] Nigam K, McCallum A K, Thrun S, et al. Text Classification from Labeled and Unlabeled Documents Using EM[J]. Machine Learning, 2000, 39(2-3): 103-134.
[23] 张博锋,白冰,苏金树. 基于自训练EM算法的半监督文本分类[J]. 国防科技大学学报, 2007, 29(6): 65-69. (Zhang Bofeng, Bai Bing, Su Jinshu. Semi-supervised Text Classification Based on Self-training EM Algorithm[J]. Journal of National University of Defense Technology, 2007, 29(6): 65-69.)
[24] 陈才扣, 喻以明. 半监督邻近鉴别分析[C]. 见: 2010年第三届计算智能与工业应用国际学术研讨会, 2010:435-438. (Chen Caikou, Yu Yiming. Semi-supervised Neighborhood Discriminant Analysis[C]. In: Proceedings of the 3rd International Conference on Computational Intelligence and Industrial Application, 2010: 435-438.)
[25] Zhu X J. Semi-Supervised Learning Literature Survey [R/OL].[2013-01-13]. http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
[26] Zhou D, Zhang C S. Semi-supervised Learning Using Random Subspace Based Linear Embedding Repulsion Graph[C]. In: Proceedings of the 31st Chinese Control Conference. 2012: 3676-3680.
[27] Zhu X J, Goldberg A B. Introduction to Semi-Supervised Learning[M]. San Rafael, CA: Morgan and Claypool Publishers, 2009: 9-19.
[1] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[3] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[4] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[5] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[6] 刘浏,王东波. 基于论文自动分类的社科类学科跨学科性研究*[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[7] 冯国明,张晓冬,刘素辉. 基于CapsNet的中文文本分类研究*[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[8] 贺婉莹,杨建林. 基于随机游走模型的排序学习方法*[J]. 数据分析与知识发现, 2017, 1(12): 41-48.
[9] 李湘东,阮涛,刘康. 基于维基百科的多种类型文献自动分类研究*[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[10] 路永和,陈景煌. 混合蛙跳算法在文本分类特征选择优化中的应用*[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[11] 刘红光,马双刚,刘桂锋. 基于降噪自动编码器的中文新闻文本分类方法研究*[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[12] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[13] 胡菊香, 吕学强, 刘克会. 利用类别引导词的投诉文本分类[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[14] 李湘东, 巴志超, 黄莉. 一种基于加权LDA模型和多粒度的文本特征选择方法[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[15] 路永, 王鸿滨. 文本分类中受词性影响的特征权重计算方法[J]. 现代图书情报技术, 2015, 31(4): 18-25.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn