Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (2): 30-35    DOI: 10.11925/infotech.1003-3513.2013.02.05
Current Issue | Archive | Adv Search |
An Algorithm of Short Text Classification Based on Semi-supervised Learning
Zhang Qian, Liu Huailiang
School of Economics and Management, Xidian University, Xi'an 710071, China
Download: PDF(878 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  According to the characteristics of short texts and the bottleneck problem of annotation in dealing with large numbers of unlabeled samples, traditional algorithms of text classification can not be used directly. This paper introduces a method of short text classification based on semi-supervised learning and builds a semi-supervised classification model. It is feasible to accomplish the self-training of the training samples and takes full advantages of the unlabeled parts of training texts by using the initial classifier. The bottleneck problem of annotation is solved and the good performance of classifier is shown. The contrast experiment shows that the algorithm of short text classification based on semi-supervised learning can get better classified effect.
Key wordsSemi-supervised learning      Text classification      Short text      Self-training     
Received: 27 January 2013      Published: 24 April 2013
:  TP391.1  

Cite this article:

Zhang Qian, Liu Huailiang. An Algorithm of Short Text Classification Based on Semi-supervised Learning. New Technology of Library and Information Service, 2013, 29(2): 30-35.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.02.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I2/30

[1] 蒲筱哥. Web自动文本分类技术研究综述[J]. 情报学报, 2009, 28(2): 233-241. (Pu Xiaoge. A Literature Review on Web Automated Text Categorization Technology[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(2): 233-241.)
[2] 苗夺谦, 卫志华. 中文文本信息处理的原理与应用[M]. 北京:清华大学出版社, 2007. (Miao Duoqian, Wei Zhihua. The Theory and Application for Chinese Text Information Processing[M]. Beijing: Tsinghua University Press, 2007. )
[3] 王细薇, 沈云琴. 中文短文本分类方法研究[J]. 现代计算机:专业版, 2010(7): 28-31. (Wang Xiwei, Shen Yunqin. Research on Chinese Short Text Classification Method[J]. Modern Computer, 2010(7): 28-31.)
[4] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3): 47-52. (Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.)
[5] Yih W T, Meek C. Improving Similarity Measures for Short Segments of Text[C]. In: Proceedings of the 22nd National Conference on Artificial Intelligence. 2007: 1489-1494.
[6] 林小俊, 张猛, 暴筱, 等. 基于概念网络的短文本分类方法[J]. 计算机工程, 2010, 36(21): 4-6. (Lin Xiaojun, Zhang Meng, Bao Xiao, et al. Short-text Classification Method Based on Concept Network[J]. Computer Engineering, 2010, 36(21): 4-6.)
[7] 蔡月红, 朱倩, 孙萍, 等. 基于属性选择的半监督短文本分类算法[J]. 计算机应用, 2010, 30(4): 1015-1018. (Cai Yuehong, Zhu Qian, Sun Ping, et al. Semi-supervised Short Text Categorization Based on Attribute Selection[J]. Journal of Computer Applications, 2010, 30(4): 1015-1018.)
[8] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145. (Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[9] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-606. (Wang Sheng, Fan Xinghua, Chen Xianlin. Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010, 30(3): 603-606.)
[10] 白秋产, 金春霞. 概念属性扩展的短文本聚类算法[J]. 长春师范学院学报, 2011, 30(5): 29-33. (Bai Qiuchan, Jin Chunxia. Short Text Clustering Algorithm Based on Concept Feature Expansion[J]. Journal of Changchun Normal University, 2011, 30(5): 29-33.)
[11] 史伟, 王洪伟, 何绍义. 基于微博平台的公众情感分析[J]. 情报学报, 2012, 31(11): 1171-1178. (Shi Wei, Wang Hongwei, He Shaoyi. Study on Public Sentiment Based on Microblogging Platform[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(11): 1171-1178.)
[12] Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2007: 787-788.
[13] Day N E. Estimating the Components of a Mixture of Normal Distributions[J]. Biometrika, 1969, 56(3): 463-474.
[14] Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society: Series B, 1977, 39(1): 1-38.
[15] Shahshanani B M, Landgrebe D A. The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon[J]. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(5): 1087-1095.
[16] 秦飞. 基于半监督学习的文本分类研究[D]. 成都:西南交通大学, 2010. (Qin Fei. Research on Document Classification Algorithm Based on Semi-supervised Learning[D]. Chengdu: Southwest Jiaotong University, 2010.)
[17] Nigam K, McCallum A, Mitchell T. Semi-supervised Text Classification Using EM[A]//Semi-supervised Learning[M]. Boston:MIT Press, 2006.
[18] 侯翠琴,焦李成. 基于图的Co-Training网页分类[J]. 电子学报, 2009, 37(10): 2173-2180. (Hou Cuiqin, Jiao Licheng. Graph Based Co-Training Algorithm for Web Page Classification[J]. Acta Electronica Sinica, 2009, 37(10): 2173-2180.)
[19] 郑海清, 林琛, 牛军钰. 一种基于紧密度的半监督文本分类方法[J]. 中文信息学报, 2007, 21(3): 54-60. (Zheng Haiqing, Lin Chen, Niu Junyu. A Closeness-based Semi-supervised Text Classification Method[J]. Journal of Chinese Information Processing, 2007, 21(3): 54-60.)
[20] Vapnik V N. Statistical Learning Theory[M]. Wiley-Interscience, 1998.
[21] [JP3]Blum A, Chawla S. Learning from Labeled and Unlabeled Data Using Graph Mincuts[C]. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, USA. 2001: 19-26.
[22] Nigam K, McCallum A K, Thrun S, et al. Text Classification from Labeled and Unlabeled Documents Using EM[J]. Machine Learning, 2000, 39(2-3): 103-134.
[23] 张博锋,白冰,苏金树. 基于自训练EM算法的半监督文本分类[J]. 国防科技大学学报, 2007, 29(6): 65-69. (Zhang Bofeng, Bai Bing, Su Jinshu. Semi-supervised Text Classification Based on Self-training EM Algorithm[J]. Journal of National University of Defense Technology, 2007, 29(6): 65-69.)
[24] 陈才扣, 喻以明. 半监督邻近鉴别分析[C]. 见: 2010年第三届计算智能与工业应用国际学术研讨会, 2010:435-438. (Chen Caikou, Yu Yiming. Semi-supervised Neighborhood Discriminant Analysis[C]. In: Proceedings of the 3rd International Conference on Computational Intelligence and Industrial Application, 2010: 435-438.)
[25] Zhu X J. Semi-Supervised Learning Literature Survey [R/OL].[2013-01-13]. http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
[26] Zhou D, Zhang C S. Semi-supervised Learning Using Random Subspace Based Linear Embedding Repulsion Graph[C]. In: Proceedings of the 31st Chinese Control Conference. 2012: 3676-3680.
[27] Zhu X J, Goldberg A B. Introduction to Semi-Supervised Learning[M]. San Rafael, CA: Morgan and Claypool Publishers, 2009: 9-19.
[1] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[3] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[4] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[5] Wanying He,Jianlin Yang. Ranking Learning Method Based on Random Walk Model[J]. 数据分析与知识发现, 2017, 1(12): 41-48.
[6] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[7] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[8] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[9] Hu Juxiang, Lv Xueqiang, Liu Kehui. Complaint Text Classification Based on Guiding Words[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[10] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[11] Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. 现代图书情报技术, 2015, 31(4): 18-25.
[12] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[13] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[14] Du Kun, Liu Huailiang, Guo Lujie. Study on the Modified Method of Feature Weighting with Complex Networks[J]. 现代图书情报技术, 2015, 31(11): 26-32.
[15] Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet[J]. 现代图书情报技术, 2014, 30(12): 36-43.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn