[1] Chakrabarti S, Berg M V D,Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[C]. In:Proceedings of the 8th International World Wide Web Conference, Toronto, Canada.1999.
[2] Aggarwal C C, Al-Garawi F, Yu P S. Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C]. In: Proceedings of the 10th International World Wide Web Conference, Hong Kong.2001.
[3] Menczer F,Pant G, Srinivasan P,et al. Evaluating Topic-Driven Web Crawler[C].In:Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, Louisiana,USA. 2001.
[4] Nie Z, Zhang Y, Wen J R, et al. Object-Level Ranking Bringing: Order to Web Objects[C]. In: Proceedings of the 14th International Conference on World Wide Web. 2005:567-574.
[5] Microsoft Academic Search[EB/OL]. [2010-03-20]. http://academic.research.microsoft.com.
[6] 吴清江,吴政,刘琳琅. 面向侨务信息主题的搜索引擎系统[J]. 华侨大学学报:自然科学版, 2006,27(4):429-432.
[7] Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems,1998,30(1-7):107-117.
[8] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of ACM, 1995, 18(11): 613-620.
[9] 王永成. 中文信息处理技术及其基础[M]. 上海:上海交通大学出版社, 1990.
[10] Koehler W. An Analysis of Web Page and Web Site Constancy and Permanence[J]. Journal of the American Society for Information Science, 1999, 50 (2): 162-180.
[11] Liu L, Pu C, Han W. XWRAP:An XML-enable Wrapper Construction System for the Web Information Source[C].In:Proceedings of the 16th IEEE International Conference on Data Engineering,San Diego.2000:611-620.
[12] Lerman K, Knoblock C, Minton S. Automatic Data Extraction from Lists and Tables in Web Sources[C]. In:Proceedings of the Workshop on Advances in Text Extraction and Mining Workshop,Menlo Park.2001.
[13] 王琦,唐世渭,杨冬清,等. 基于DOM的网页主题信息自动提取[J]. 计算机研究与发展, 2004, 41 (10): 1786-1792.
[14] 崔继馨,张鹏,杨文柱. 基于DOM的Web信息抽取[J]. 河北农业大学学报, 2005, 28 (3):90-93.
[15] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报, 2004, 18 (5):17-22.
[16] Cai D,Yu S, Wen J,et al.VIPS:A Vision-based Page Segmentation Algorithm[R].Microsoft Technical Report,MSR-TR-2003-79. 2003.
[17] The Easy Way to Extract Useful Text from Arbitrary HTML[OL].[2010-03-20].http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html.2007.
[18] 宁力. 搜索引擎中网页查重方法的研究[D].北京:北京化工大学,2007.
[19] 钱爱兵,江岚. 基于后缀树的中文新闻重复网页识别算法[J].现代图书情报技术,2008(3):55-61.
[20] Bun K K, Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm[C]. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering.Singapore:IEEE CS Press,2002:73-82. |