This paper has implemented a text categorization system based on Vector Space Model(VSM) and Naive-Bayes(NB). When estimating the category, the authors enhance the veracity of parent-category by emendation of subcategory, and judge whether document has multi-classification and multi-label by estimating the similar difference of classifier’s final values. The experiment proves that VSM is better than NB in text representation: MicroF1 increases of 25.2 percent of parent-category, and MicroF1 increases of 26.3 percent of sub-category.
刘华 . 文本分类相似度模型和概率模型的实现与比较*[J]. 现代图书情报技术, 2006, 1(4): 53-55.
Liu Hua . Implementation and Comparison of Similarity and Probabilistic Mode in Text Categorization. New Technology of Library and Information Service, 2006, 1(4): 53-55.
1Fabrizio Sebastiani. Machine learning in automated text categorization.ACM Computing Surveys,2002,34(1):1-47
2Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval,1999,1(1/2): 67-88
3庞剑锋等.基于向量空间模型的文本自动分类系统的研究与实现.计算机应用研究, 2001,18(9):23-26
4陈克利.基于大规模真实文本的平衡语料分析与文本分类方法.Advances in Computation of Oriental Languages.北京:清华大学出版社,2003. 540-545
5施彤年,卢忠良.多类多标签汉语文本自动分类的研究.情报学报,2003,22(3):306-309
6张宇,刘挺,文勖.基于改进贝叶斯模型的问题分类.中文信息学报,2005,19(2):100-105