Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (2): 30-35    DOI: 10.11925/infotech.1003-3513.2013.02.05
Current Issue | Archive | Adv Search |
An Algorithm of Short Text Classification Based on Semi-supervised Learning
Zhang Qian, Liu Huailiang
School of Economics and Management, Xidian University, Xi'an 710071, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  According to the characteristics of short texts and the bottleneck problem of annotation in dealing with large numbers of unlabeled samples, traditional algorithms of text classification can not be used directly. This paper introduces a method of short text classification based on semi-supervised learning and builds a semi-supervised classification model. It is feasible to accomplish the self-training of the training samples and takes full advantages of the unlabeled parts of training texts by using the initial classifier. The bottleneck problem of annotation is solved and the good performance of classifier is shown. The contrast experiment shows that the algorithm of short text classification based on semi-supervised learning can get better classified effect.
Key wordsSemi-supervised learning      Text classification      Short text      Self-training     
Received: 27 January 2013      Published: 24 April 2013
:  TP391.1  

Cite this article:

Zhang Qian, Liu Huailiang. An Algorithm of Short Text Classification Based on Semi-supervised Learning. New Technology of Library and Information Service, 2013, 29(2): 30-35.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.02.05     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I2/30

[1] 蒲筱哥. Web自动文本分类技术研究综述[J]. 情报学报, 2009, 28(2): 233-241. (Pu Xiaoge. A Literature Review on Web Automated Text Categorization Technology[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(2): 233-241.)
[2] 苗夺谦, 卫志华. 中文文本信息处理的原理与应用[M]. 北京:清华大学出版社, 2007. (Miao Duoqian, Wei Zhihua. The Theory and Application for Chinese Text Information Processing[M]. Beijing: Tsinghua University Press, 2007. )
[3] 王细薇, 沈云琴. 中文短文本分类方法研究[J]. 现代计算机:专业版, 2010(7): 28-31. (Wang Xiwei, Shen Yunqin. Research on Chinese Short Text Classification Method[J]. Modern Computer, 2010(7): 28-31.)
[4] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3): 47-52. (Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.)
[5] Yih W T, Meek C. Improving Similarity Measures for Short Segments of Text[C]. In: Proceedings of the 22nd National Conference on Artificial Intelligence. 2007: 1489-1494.
[6] 林小俊, 张猛, 暴筱, 等. 基于概念网络的短文本分类方法[J]. 计算机工程, 2010, 36(21): 4-6. (Lin Xiaojun, Zhang Meng, Bao Xiao, et al. Short-text Classification Method Based on Concept Network[J]. Computer Engineering, 2010, 36(21): 4-6.)
[7] 蔡月红, 朱倩, 孙萍, 等. 基于属性选择的半监督短文本分类算法[J]. 计算机应用, 2010, 30(4): 1015-1018. (Cai Yuehong, Zhu Qian, Sun Ping, et al. Semi-supervised Short Text Categorization Based on Attribute Selection[J]. Journal of Computer Applications, 2010, 30(4): 1015-1018.)
[8] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145. (Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[9] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-606. (Wang Sheng, Fan Xinghua, Chen Xianlin. Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010, 30(3): 603-606.)
[10] 白秋产, 金春霞. 概念属性扩展的短文本聚类算法[J]. 长春师范学院学报, 2011, 30(5): 29-33. (Bai Qiuchan, Jin Chunxia. Short Text Clustering Algorithm Based on Concept Feature Expansion[J]. Journal of Changchun Normal University, 2011, 30(5): 29-33.)
[11] 史伟, 王洪伟, 何绍义. 基于微博平台的公众情感分析[J]. 情报学报, 2012, 31(11): 1171-1178. (Shi Wei, Wang Hongwei, He Shaoyi. Study on Public Sentiment Based on Microblogging Platform[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(11): 1171-1178.)
[12] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2007: 787-788.
[13] Day N E. Estimating the Components of a Mixture of Normal Distributions[J]. Biometrika, 1969, 56(3): 463-474.
[14] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society: Series B, 1977, 39(1): 1-38.
[15] Shahshanani B M, Landgrebe D A. The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon[J]. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(5): 1087-1095.
[16] 秦飞. 基于半监督学习的文本分类研究[D]. 成都:西南交通大学, 2010. (Qin Fei. Research on Document Classification Algorithm Based on Semi-supervised Learning[D]. Chengdu: Southwest Jiaotong University, 2010.)
[17] Nigam K, McCallum A, Mitchell T. Semi-supervised Text Classification Using EM[A]//Semi-supervised Learning[M]. Boston:MIT Press, 2006.
[18] 侯翠琴,焦李成. 基于图的Co-Training网页分类[J]. 电子学报, 2009, 37(10): 2173-2180. (Hou Cuiqin, Jiao Licheng. Graph Based Co-Training Algorithm for Web Page Classification[J]. Acta Electronica Sinica, 2009, 37(10): 2173-2180.)
[19] 郑海清, 林琛, 牛军钰. 一种基于紧密度的半监督文本分类方法[J]. 中文信息学报, 2007, 21(3): 54-60. (Zheng Haiqing, Lin Chen, Niu Junyu. A Closeness-based Semi-supervised Text Classification Method[J]. Journal of Chinese Information Processing, 2007, 21(3): 54-60.)
[20] Vapnik V N. Statistical Learning Theory[M]. Wiley-Interscience, 1998.
[21] [JP3]Blum A, Chawla S. Learning from Labeled and Unlabeled Data Using Graph Mincuts[C]. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, USA. 2001: 19-26.
[22] Nigam K, McCallum A K, Thrun S, et al. Text Classification from Labeled and Unlabeled Documents Using EM[J]. Machine Learning, 2000, 39(2-3): 103-134.
[23] 张博锋,白冰,苏金树. 基于自训练EM算法的半监督文本分类[J]. 国防科技大学学报, 2007, 29(6): 65-69. (Zhang Bofeng, Bai Bing, Su Jinshu. Semi-supervised Text Classification Based on Self-training EM Algorithm[J]. Journal of National University of Defense Technology, 2007, 29(6): 65-69.)
[24] 陈才扣, 喻以明. 半监督邻近鉴别分析[C]. 见: 2010年第三届计算智能与工业应用国际学术研讨会, 2010:435-438. (Chen Caikou, Yu Yiming. Semi-supervised Neighborhood Discriminant Analysis[C]. In: Proceedings of the 3rd International Conference on Computational Intelligence and Industrial Application, 2010: 435-438.)
[25] Zhu X J. Semi-Supervised Learning Literature Survey [R/OL].[2013-01-13]. http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
[26] Zhou D, Zhang C S. Semi-supervised Learning Using Random Subspace Based Linear Embedding Repulsion Graph[C]. In: Proceedings of the 31st Chinese Control Conference. 2012: 3676-3680.
[27] Zhu X J, Goldberg A B. Introduction to Semi-Supervised Learning[M]. San Rafael, CA: Morgan and Claypool Publishers, 2009: 9-19.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] Wu Xu,Chen Chunxu. Detecting Topics of Group Chats with Multiple Strategies[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
[5] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[6] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[7] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[8] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[9] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[10] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[11] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[12] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[14] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[15] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn