Abstract:The paper firstly analyzes the distribution characteristics of computer education resources on Web, then it designs a multi-layer classifier to resolve the topic classification problem in topic crawling procedure by combining topic words and resources forms, and introduces how to make the precise classification fusion by Naive Bayes Classifier model and how the resources are stored correctly into the hard disk. Finally, experiment results show that the key design idea is feasible and many performances are acceptable, such as the avarage accuracy of the topic classification algorithm reaches to 78% as well as the avarage recall accuracy reaches to 61% and the avarage resources parsing accuracy reaches to 81.5%.
张红斌, 曹义亲. 混合多层分类和朴素贝叶斯模型的垂直搜索引擎分类器设计[J]. 现代图书情报技术, 2011, 27(3): 73-79.
Zhang Hongbin, Cao Yiqin. A New Classifier Design in a Topic Search Engine by Combining Multi-layer Classifier with Naive Bayes Classification Model. New Technology of Library and Information Service, 2011, 27(3): 73-79.
[1] Li G, Zhang H. Design of a Distributed Spiders System Based on Web Service [C]. In: Proceedings of the 2nd Asia Conference on Web Mining and Web-based Application. Washington, DC, USA:IEEE Computer Society, 2009: 167-170.[2] 李广丽. 基于网页内容评价和Web图的启发式垂直搜索策略的设计[J]. 情报理论与实践,2009,32(9):121-124.[3] 刘奕群,马少平,洪涛,等. 搜索引擎技术基础[M]. 北京:清华大学出版社,2010.[4] Zhang H, Liu J. Search Engine Design Based on Web Service and Lucene[C]. In: Proceedings of the 2009 WASE International Conference on Information Engineering. Washington, DC, USA:IEEE Computer Society, 2009:458-461.[5] 李广丽.垂直搜索引擎的研究与设计[D].南昌:华东交通大学,2008.[6] 百度文库-文档分享平台[EB/OL]. [2010-02-14]. http://wenku.baidu.com/.[7] “IT计算机”-豆丁网[EB/OL]. [2010-02-14]. http://www.docin.com/l-10017-0-0-0-0-1.html.[8] 朴素贝叶斯_百度百科[EB/OL]. [2010-11-16]. http://baike.baidu.com/view/992724.htm.[9] 许鑫,黄仲清. 垂直搜索引擎应用中的若干策略探讨——以12580餐饮垂直搜索为例[J].现代图书情报技术,2009(2):62-70.[10] Heritrix开发文档[EB/OL].[2010-04-03]. http://crawler.archive.org/articles/developer_manual.html.[11] Welcome to Apache Lucene[EB/OL]. [2010-02-14]. http://lucene.apache.org/.[12] Apache POI-Text Extraction[EB/OL]. [2010-02-13]. http://poi.apache.org/text-extraction.html.[13] 使用PDFBox处理PDF文档[EB/OL]. [2010-04-20]. http://www.cnblogs.com/ hejycpu/archive/2009/01/19/1378380.html.[14] Lucene中文分词庖丁解牛2.0.0版本发布[EB/OL]. [2010-04-20]. http://java.ccidnet.com/art/12013/20070821/1185171_1.html.