Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. 29 Issue (2): 30-35     https://doi.org/10.11925/infotech.1003-3513.2013.02.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
一种基于半监督学习的短文本分类方法
张倩, 刘怀亮
西安电子科技大学经济与管理学院 西安 710071
An Algorithm of Short Text Classification Based on Semi-supervised Learning
Zhang Qian, Liu Huailiang
School of Economics and Management, Xidian University, Xi'an 710071, China
全文: PDF (878 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 针对短文本的特征词较少、信息关联性不强以及存在大量样本的标注瓶颈问题,传统的文本分类方法已不能较好地直接适用。将半监督学习思想引入到文本分类过程中,提出一种基于半监督学习的短文本分类方法,通过使用外部网络知识库来扩充短文本特征,构建基于半监督学习的分类模型,使用初始分类器进行迭代自学习实现训练样本中未标注部分的充分利用,从而解决标注瓶颈,提高分类器的性能。对比实验表明,该方法能够提升短文本分类的效果。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张倩
刘怀亮
关键词 半监督学习文本分类短文本自训练    
Abstract:According to the characteristics of short texts and the bottleneck problem of annotation in dealing with large numbers of unlabeled samples, traditional algorithms of text classification can not be used directly. This paper introduces a method of short text classification based on semi-supervised learning and builds a semi-supervised classification model. It is feasible to accomplish the self-training of the training samples and takes full advantages of the unlabeled parts of training texts by using the initial classifier. The bottleneck problem of annotation is solved and the good performance of classifier is shown. The contrast experiment shows that the algorithm of short text classification based on semi-supervised learning can get better classified effect.
Key wordsSemi-supervised learning    Text classification    Short text    Self-training
收稿日期: 2013-01-27      出版日期: 2013-04-24
:  TP391.1  
通讯作者: 张倩,zqvictory2011@yeah.net     E-mail: zqvictory2011@yeah.net
引用本文:   
张倩, 刘怀亮. 一种基于半监督学习的短文本分类方法[J]. 现代图书情报技术, 2013, 29(2): 30-35.
Zhang Qian, Liu Huailiang. An Algorithm of Short Text Classification Based on Semi-supervised Learning. New Technology of Library and Information Service, 2013, 29(2): 30-35.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.02.05      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V29/I2/30
[1] 蒲筱哥. Web自动文本分类技术研究综述[J]. 情报学报, 2009, 28(2): 233-241. (Pu Xiaoge. A Literature Review on Web Automated Text Categorization Technology[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(2): 233-241.)
[2] 苗夺谦, 卫志华. 中文文本信息处理的原理与应用[M]. 北京:清华大学出版社, 2007. (Miao Duoqian, Wei Zhihua. The Theory and Application for Chinese Text Information Processing[M]. Beijing: Tsinghua University Press, 2007. )
[3] 王细薇, 沈云琴. 中文短文本分类方法研究[J]. 现代计算机:专业版, 2010(7): 28-31. (Wang Xiwei, Shen Yunqin. Research on Chinese Short Text Classification Method[J]. Modern Computer, 2010(7): 28-31.)
[4] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3): 47-52. (Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.)
[5] Yih W T, Meek C. Improving Similarity Measures for Short Segments of Text[C]. In: Proceedings of the 22nd National Conference on Artificial Intelligence. 2007: 1489-1494.
[6] 林小俊, 张猛, 暴筱, 等. 基于概念网络的短文本分类方法[J]. 计算机工程, 2010, 36(21): 4-6. (Lin Xiaojun, Zhang Meng, Bao Xiao, et al. Short-text Classification Method Based on Concept Network[J]. Computer Engineering, 2010, 36(21): 4-6.)
[7] 蔡月红, 朱倩, 孙萍, 等. 基于属性选择的半监督短文本分类算法[J]. 计算机应用, 2010, 30(4): 1015-1018. (Cai Yuehong, Zhu Qian, Sun Ping, et al. Semi-supervised Short Text Categorization Based on Attribute Selection[J]. Journal of Computer Applications, 2010, 30(4): 1015-1018.)
[8] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145. (Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[9] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-606. (Wang Sheng, Fan Xinghua, Chen Xianlin. Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010, 30(3): 603-606.)
[10] 白秋产, 金春霞. 概念属性扩展的短文本聚类算法[J]. 长春师范学院学报, 2011, 30(5): 29-33. (Bai Qiuchan, Jin Chunxia. Short Text Clustering Algorithm Based on Concept Feature Expansion[J]. Journal of Changchun Normal University, 2011, 30(5): 29-33.)
[11] 史伟, 王洪伟, 何绍义. 基于微博平台的公众情感分析[J]. 情报学报, 2012, 31(11): 1171-1178. (Shi Wei, Wang Hongwei, He Shaoyi. Study on Public Sentiment Based on Microblogging Platform[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(11): 1171-1178.)
[12] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2007: 787-788.
[13] Day N E. Estimating the Components of a Mixture of Normal Distributions[J]. Biometrika, 1969, 56(3): 463-474.
[14] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society: Series B, 1977, 39(1): 1-38.
[15] Shahshanani B M, Landgrebe D A. The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon[J]. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(5): 1087-1095.
[16] 秦飞. 基于半监督学习的文本分类研究[D]. 成都:西南交通大学, 2010. (Qin Fei. Research on Document Classification Algorithm Based on Semi-supervised Learning[D]. Chengdu: Southwest Jiaotong University, 2010.)
[17] Nigam K, McCallum A, Mitchell T. Semi-supervised Text Classification Using EM[A]//Semi-supervised Learning[M]. Boston:MIT Press, 2006.
[18] 侯翠琴,焦李成. 基于图的Co-Training网页分类[J]. 电子学报, 2009, 37(10): 2173-2180. (Hou Cuiqin, Jiao Licheng. Graph Based Co-Training Algorithm for Web Page Classification[J]. Acta Electronica Sinica, 2009, 37(10): 2173-2180.)
[19] 郑海清, 林琛, 牛军钰. 一种基于紧密度的半监督文本分类方法[J]. 中文信息学报, 2007, 21(3): 54-60. (Zheng Haiqing, Lin Chen, Niu Junyu. A Closeness-based Semi-supervised Text Classification Method[J]. Journal of Chinese Information Processing, 2007, 21(3): 54-60.)
[20] Vapnik V N. Statistical Learning Theory[M]. Wiley-Interscience, 1998.
[21] [JP3]Blum A, Chawla S. Learning from Labeled and Unlabeled Data Using Graph Mincuts[C]. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, USA. 2001: 19-26.
[22] Nigam K, McCallum A K, Thrun S, et al. Text Classification from Labeled and Unlabeled Documents Using EM[J]. Machine Learning, 2000, 39(2-3): 103-134.
[23] 张博锋,白冰,苏金树. 基于自训练EM算法的半监督文本分类[J]. 国防科技大学学报, 2007, 29(6): 65-69. (Zhang Bofeng, Bai Bing, Su Jinshu. Semi-supervised Text Classification Based on Self-training EM Algorithm[J]. Journal of National University of Defense Technology, 2007, 29(6): 65-69.)
[24] 陈才扣, 喻以明. 半监督邻近鉴别分析[C]. 见: 2010年第三届计算智能与工业应用国际学术研讨会, 2010:435-438. (Chen Caikou, Yu Yiming. Semi-supervised Neighborhood Discriminant Analysis[C]. In: Proceedings of the 3rd International Conference on Computational Intelligence and Industrial Application, 2010: 435-438.)
[25] Zhu X J. Semi-Supervised Learning Literature Survey [R/OL].[2013-01-13]. http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
[26] Zhou D, Zhang C S. Semi-supervised Learning Using Random Subspace Based Linear Embedding Repulsion Graph[C]. In: Proceedings of the 31st Chinese Control Conference. 2012: 3676-3680.
[27] Zhu X J, Goldberg A B. Introduction to Semi-Supervised Learning[M]. San Rafael, CA: Morgan and Claypool Publishers, 2009: 9-19.
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] 吴旭,陈春旭. 基于多策略的群聊话题检测技术*[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
[5] 刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[6] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[7] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[8] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[9] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[10] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[11] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[12] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[14] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[15] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn