Please wait a minute...
New Technology of Library and Information Service  2006, Vol. 1 Issue (4): 53-55    DOI: 10.11925/infotech.1003-3513.2006.04.13
Current Issue | Archive | Adv Search |
Implementation and Comparison of Similarity and Probabilistic Mode in Text Categorization
Liu Hua
(College of Chinese Language and Culture of Jinan University, Guangzhou 510610, China)
Download: PDF(0 KB)   HTML  
Export: BibTeX | EndNote (RIS)      

This paper has implemented a text categorization system based on Vector Space Model(VSM) and Naive-Bayes(NB). When estimating the category, the authors enhance the veracity of parent-category by emendation of subcategory, and judge whether document has multi-classification and multi-label by estimating the similar difference of classifier’s final values. The experiment proves that VSM is better than NB in text representation: MicroF1 increases of 25.2 percent of parent-category, and MicroF1 increases of 26.3 percent of sub-category.

Key wordsText categorization      Vector space model      Naive-Bayes     
Received: 12 January 2006      Published: 25 April 2006


Corresponding Authors: Liu Hua     E-mail:
About author:: Liu Hua

Cite this article:

Liu Hua . Implementation and Comparison of Similarity and Probabilistic Mode in Text Categorization. New Technology of Library and Information Service, 2006, 1(4): 53-55.

URL:     OR

1Fabrizio Sebastiani. Machine learning in automated text categorization.ACM Computing Surveys,2002,34(1):1-47
2Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval,1999,1(1/2): 67-88
3庞剑锋等.基于向量空间模型的文本自动分类系统的研究与实现.计算机应用研究, 2001,18(9):23-26
4陈克利.基于大规模真实文本的平衡语料分析与文本分类方法.Advances in Computation of Oriental Languages.北京:清华大学出版社,2003. 540-545

[1] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[2] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[3] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[4] Rujiang Bai,Fuhai Leng,Junhua Liao. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
[5] Xu Dongdong, Wu Shaobo. An Improved TF-IDF Feature Selection Based on Categorical Description[J]. 现代图书情报技术, 2015, 31(3): 39-48.
[6] Tan Xueqing, Zhou Tong, Luo Lin. A Text Classification Algorithm Based on the Average Category Similarity[J]. 现代图书情报技术, 2014, 30(9): 66-73.
[7] Li Xiangdong, He Haihong, Cao Huan, Huang Li. An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution[J]. 现代图书情报技术, 2014, 30(7): 24-33.
[8] Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. 现代图书情报技术, 2014, 30(5): 18-25.
[9] Lu Yonghe, Liang Minghui. Improvement of Text Feature Extraction with Genetic Algorithm[J]. 现代图书情报技术, 2014, 30(4): 48-57.
[10] Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. 现代图书情报技术, 2014, 30(3): 80-87.
[11] Hu Jiming, Xiao Lu. Semantic Incremental Improvement on Vector Space Model for Text Modeling[J]. 现代图书情报技术, 2014, 30(10): 49-55.
[12] Lu Yonghe, Li Yanfeng. A Feature Selection Based on Consideration of Multiple Factors[J]. 现代图书情报技术, 2013, (5): 34-39.
[13] Qu Peng, Wang Huilin. Fundamental Research Questions in Patent Text Categorization[J]. 现代图书情报技术, 2013, 29(3): 38-44.
[14] Xu Kun, Cao Jindan, Bi Qiang. A Study and Application on Medical Text Categorization Based on FCA[J]. 现代图书情报技术, 2012, 28(3): 23-26.
[15] Lu Yonghe, He Xinyu. An Application of Sharpen Gaussian Template in a Text Feature Weight Adjustment Methodology[J]. 现代图书情报技术, 2012, (12): 39-44.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938