Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (10): 59-68    DOI: 10.11925/infotech.1003-3513.2008.10.12
Current Issue | Archive | Adv Search |
Automatic Classification Based on News Titles for Chinese News Web Pages
Qian Aibing1  Jiang Lan 2
1(School of Economy and Commercial Management, Nanjing University of Chinese Medicine, Nanjing 210046, China)
2(Department of Information Management, Nanjing University, Nanjing 210093, China)
Download: PDF (528 KB)  
Export: BibTeX | EndNote (RIS)      

This paper describes automatic Chinese news Web pages classification by using news title based on tf-idf weighting scheme, and constructs correlation degree of news title which determines appropriate category for each news Web page. The performance of this proposed method is evaluated in terms of top one score, top two score, and top three score. The experimental evaluation demonstrates that improved tf-idf weighting scheme with categories provides high accuracy with the classification of Chinese news Web pages.

Key wordstf-idf      News title      Chinese news Web pages      Automatic classification     
Received: 02 July 2008      Published: 25 October 2008



Corresponding Authors: Qian Aibing     E-mail:
About author:: Qian Aibing,Jiang Lan

Cite this article:

Qian Aibing,Jiang Lan . Automatic Classification Based on News Titles for Chinese News Web Pages. New Technology of Library and Information Service, 2008, 24(10): 59-68.

URL:     OR

[1] Fuchun P, Schuurmans D, Shaojun W. Augmenting Naive Bayes Classifiers with Statistical Language Models [J]. Information Retrieval, 2004(7):317-345.
[2] 秦兵, 郑实福, 刘挺, 等. 可分性判据在中文网页分类中的应用[J]. 微处理机, 2002(1):26-28.
[3] Joachims T. Text Categoriztion with Support Vector Machine: Learning with Many Relevant Features [C]. In: Proceedings of the European Conference on Machine Learning (ECML-98), Chemnitz. Germany, 1998: 137-142.
[4] Joachims T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms [M]. Boston: Kluwer Academic Publishers, 2002:1-176.
[5] Rung-Ching C, Chung-Hsun H. Web Page Classification Based on a Support Vector Machine Using a Weighted Vote Schema[J]. Expert Systems with Applications, 2006, 31(2): 427-435.
[6] Yiming Y, Liu X. A Re-Examination of Text Categorization Methods[C]. In: Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999: 42-49.
[7] Jyh-Jong T, Wang Jing-Doo. Improving Automatic Chinese Text Categorization by Error Correction[C]. In: Proceedings of the 5th International Workshop Information Retrieval with Asian Languages, 2000: 1-8.
[8] 邓茜, 林红. 中文新闻信息自动分类标引的构想与实现[J]. 中国传媒科技, 2005(9):19-21.
[9] 侯汉清, 薛鹏军. 基于知识库的网页自动标引和自动分类系统的设计[J]. 大学图书馆学报, 2004, 1(9):50-55,64.
[10] 何琳, 侯汉清, 白振田, 等. 基于标引经验和机器学习相结合的多层自动分类[J]. 情报学报, 2006, 25(6):725-729.
[11] 姜远, 周志华. 基于词频分类器集成的文本分类方法[J]. 计算机研究与发展, 2006, 43(10):1681-1687.
[12] 搜狗实验室. 文本分类语料库[EB/OL]. [2008-07-20].
[13] 北京大学网络实验室. 中文网页分类训练集[EB/OL]. [2008-07-20].
[14] 中华人民共和国国家质量监督检验检疫总局, 中国国家标准化管理委员会. GB/T 20093-2006 中文新闻信息分类与代码[S]. 北京: 中国标准出版社, 2006.
[15] 高惠璇. 应用多元统计分析[M]. 北京: 北京大学出版社, 2005:183-191.
[16] 吕震宇. SharpICTCLAS分词系统[EB/OL]. [2008-04-10].
[17] 中国科学院计算技术研究所. 汉语词法分析系统ICTCLAS[EB/OL]. [2008-04-10].
[18] 詹卫东. 中文信息处理基础[EB/OL]. [2008-04-10].
[19] Apache. Lucene [EB/OL]. [2008-04-10].
[20] Apache incubator. Lucene .Net [EB/OL]. [2008-04-10].
[21] Dell Z, Yisheng D. Semantic, Hierarchical, Online Clustering of Web Search Results[C]. In: Proceedings of  the 6th Asia Pacific Web Conference (APWEB),Hangzhou.2004: 69-78.

[1] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[2] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[3] Yin Cong,Zhang Liyi. Recommendation Algorithm for Post-Context Filtering Based on TF-IDF: Case Study of Catering O2O[J]. 数据分析与知识发现, 2018, 2(11): 28-36.
[4] Li Changbing,Pang Chongpeng,Li Meiping. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[5] Deng Sanhong,Fu Yuyangzi,Wang Hao. Multi-Label Classification of Chinese Books with LSTM Model[J]. 数据分析与知识发现, 2017, 1(7): 52-60.
[6] He Yue,Xiao Min,Zhang Yue. Sentiment Analysis of Trending Topics Based on Relevance[J]. 数据分析与知识发现, 2017, 1(3): 46-53.
[7] Li Xiangdong,Ba Zhichao,Gao Fan. Review of Digital Documents Automatic Classification Research[J]. 现代图书情报技术, 2016, 32(9): 17-26.
[8] Xu Dongdong, Wu Shaobo. An Improved TF-IDF Feature Selection Based on Categorical Description[J]. 现代图书情报技术, 2015, 31(3): 39-48.
[9] He Lin, Wan Jian, He Juan, Guo Shiyun. Research on Automatic Classification of Chinese Books Based on Social Tagging[J]. 现代图书情报技术, 2014, 30(9): 1-7.
[10] Hu Bing, Zhang Jianli. Research on Chinese Patent Automatic Classification Method Based on Statistical Distribution[J]. 现代图书情报技术, 2013, 29(7/8): 101-106.
[11] Lu Yonghe, Li Yanfeng. A Feature Selection Based on Consideration of Multiple Factors[J]. 现代图书情报技术, 2013, (5): 34-39.
[12] Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification[J]. 现代图书情报技术, 2013, 29(10): 27-30.
[13] Ye Chunlei, Leng Fuhai. Study on the Keyword Extraction from Roadmap Based on the Lexical Chains[J]. 现代图书情报技术, 2013, 29(1): 50-56.
[14] Xu Jian, Wen Haosheng. Study on Talents Description Web Page Automatic Recognition System[J]. 现代图书情报技术, 2011, 27(6): 20-26.
[15] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938