|
|
Automatic Classification Based on News Titles for Chinese News Web Pages |
Qian Aibing1 Jiang Lan 2 |
1(School of Economy and Commercial Management, Nanjing University of Chinese Medicine, Nanjing 210046, China)
2(Department of Information Management, Nanjing University, Nanjing 210093, China) |
|
|
Abstract This paper describes automatic Chinese news Web pages classification by using news title based on tf-idf weighting scheme, and constructs correlation degree of news title which determines appropriate category for each news Web page. The performance of this proposed method is evaluated in terms of top one score, top two score, and top three score. The experimental evaluation demonstrates that improved tf-idf weighting scheme with categories provides high accuracy with the classification of Chinese news Web pages.
|
Received: 02 July 2008
Published: 25 October 2008
|
|
Corresponding Authors:
Qian Aibing
E-mail: happyfate2001@yahoo.com.cn
|
About author:: Qian Aibing,Jiang Lan |
[1] Fuchun P, Schuurmans D, Shaojun W. Augmenting Naive Bayes Classifiers with Statistical Language Models [J]. Information Retrieval, 2004(7):317-345.
[2] 秦兵, 郑实福, 刘挺, 等. 可分性判据在中文网页分类中的应用[J]. 微处理机, 2002(1):26-28.
[3] Joachims T. Text Categoriztion with Support Vector Machine: Learning with Many Relevant Features [C]. In: Proceedings of the European Conference on Machine Learning (ECML-98), Chemnitz. Germany, 1998: 137-142.
[4] Joachims T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms [M]. Boston: Kluwer Academic Publishers, 2002:1-176.
[5] Rung-Ching C, Chung-Hsun H. Web Page Classification Based on a Support Vector Machine Using a Weighted Vote Schema[J]. Expert Systems with Applications, 2006, 31(2): 427-435.
[6] Yiming Y, Liu X. A Re-Examination of Text Categorization Methods[C]. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999: 42-49.
[7] Jyh-Jong T, Wang Jing-Doo. Improving Automatic Chinese Text Categorization by Error Correction[C]. In: Proceedings of the 5th International Workshop Information Retrieval with Asian Languages, 2000: 1-8.
[8] 邓茜, 林红. 中文新闻信息自动分类标引的构想与实现[J]. 中国传媒科技, 2005(9):19-21.
[9] 侯汉清, 薛鹏军. 基于知识库的网页自动标引和自动分类系统的设计[J]. 大学图书馆学报, 2004, 1(9):50-55,64.
[10] 何琳, 侯汉清, 白振田, 等. 基于标引经验和机器学习相结合的多层自动分类[J]. 情报学报, 2006, 25(6):725-729.
[11] 姜远, 周志华. 基于词频分类器集成的文本分类方法[J]. 计算机研究与发展, 2006, 43(10):1681-1687.
[12] 搜狗实验室. 文本分类语料库[EB/OL]. [2008-07-20]. http://www.sogou.com/labs/dl/c.html.
[13] 北京大学网络实验室. 中文网页分类训练集[EB/OL]. [2008-07-20]. http://www.cwirf.org/2006WebTrack/YQ-CCT-2006-03.tgz.
[14] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 20093-2006 中文新闻信息分类与代码[S]. 北京: 中国标准出版社, 2006.
[15] 高惠璇. 应用多元统计分析[M]. 北京: 北京大学出版社, 2005:183-191.
[16] 吕震宇. SharpICTCLAS分词系统[EB/OL]. [2008-04-10]. http://www.cnblogs.com/zhenyulu/category/85598.html.
[17] 中国科学院计算技术研究所. 汉语词法分析系统ICTCLAS[EB/OL]. [2008-04-10]. http://www.i3s.ac.cn/index.htm.
[18] 詹卫东. 中文信息处理基础[EB/OL]. [2008-04-10]. http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/2002_2003_1.htm.
[19] Apache. Lucene [EB/OL]. [2008-04-10]. http://lucene.apache.org/.
[20] Apache incubator. Lucene .Net [EB/OL]. [2008-04-10]. http://incubator.apache.org/lucene.net/.
[21] Dell Z, Yisheng D. Semantic, Hierarchical, Online Clustering of Web Search Results[C]. In: Proceedings of the 6th Asia Pacific Web Conference (APWEB),Hangzhou.2004: 69-78. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|