|
|
A Text Categorization System with C# |
Liu Hua |
(College of Chinese Language and Culture/ Center for Overseas Huayu Research,Jinan University, Guangzhou 510610, China) |
|
|
Abstract Based on Vector Space Model(VSM) and Nave-Bayes(NB), completed a multilayer and multi-classification text categorization system. Introduce detailedly four modules: words’ segmentation and frequency statistics, calculating between classifications’ and document, emendating the veracity of parent-class by emendation of subclass, judging whether document has multi-classification and multi-label. Text representation based on Vector Space Model has 89.7% MicroF1 of parent- category, 77.8% of sub- category; text representation based on Nave-Bayes has 67.6% MicroF1 of parent- category, 66.5% of sub- category.
|
Received: 27 January 2007
Published: 25 March 2007
|
|
Corresponding Authors:
Liu Hua
E-mail: liuhua0461@sina.com
|
About author:: Liu Hua |
1Fabrizio Sebastiani. Machine Learning in Automated Text Categorization.ACM Computing Surveys,2002,34(1):1-47
2骆昌日,张新华,何婷婷,骆世广.基于DCM的中文文本分类.计算机工程与应用, 2006,42(34):157-159
3陈克利.基于大规模真实文本的平衡语料分析与文本分类方法.Advances in Computation of Oriental Languages.北京:清华大学出版社,2003. 540-545
4施彤年,卢忠良.多类多标签汉语文本自动分类的研究.情报学报,2003,22(3):306-309
5罗远胜,王明文,曾雪强.基于核方法的潜在语义文本分类模型.清华大学学报(自然科学版),2005,45(9):1853-1856 |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|