Design and Implementation of Internet Information Acquisition System on Overseas Chinese
Xu Xin1 Huang Zhongqing1 Deng Sanhong2
1(Department of Informatics, East China Normal University, Shanghai 200241,China) 2(Department of Information Management, Nanjing University, Nanjing 210093,China)
This paper proposes an anti-shielding solution integrated with different technologies to avoid shielding, improves Web content extraction based on text density, adopts eliminating duplication technology based on VSM and cosine angle formula, and develops a system of the Internet subject acquisition system on overseas Chinese.
许鑫 黄仲清 邓三鸿. 互联网侨情信息采集系统设计与实现*[J]. 现代图书情报技术, 2010, 26(7/8): 95-101.
Xu Xin Huang Zhongqing Deng Sanhong. Design and Implementation of Internet Information Acquisition System on Overseas Chinese. New Technology of Library and Information Service, 2010, 26(7/8): 95-101.
[1] Chakrabarti S, Berg M V D,Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[C]. In:Proceedings of the 8th International World Wide Web Conference, Toronto, Canada.1999.
[2] Aggarwal C C, Al-Garawi F, Yu P S. Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C]. In: Proceedings of the 10th International World Wide Web Conference, Hong Kong.2001.
[3] Menczer F,Pant G, Srinivasan P,et al. Evaluating Topic-Driven Web Crawler[C].In:Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, Louisiana,USA. 2001.
[4] Nie Z, Zhang Y, Wen J R, et al. Object-Level Ranking Bringing: Order to Web Objects[C]. In: Proceedings of the 14th International Conference on World Wide Web. 2005:567-574.
[5] Microsoft Academic Search[EB/OL]. [2010-03-20]. http://academic.research.microsoft.com.
[6] 吴清江,吴政,刘琳琅. 面向侨务信息主题的搜索引擎系统[J]. 华侨大学学报:自然科学版, 2006,27(4):429-432.
[7] Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems,1998,30(1-7):107-117.
[8] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of ACM, 1995, 18(11): 613-620.
[9] 王永成. 中文信息处理技术及其基础[M]. 上海:上海交通大学出版社, 1990.
[10] Koehler W. An Analysis of Web Page and Web Site Constancy and Permanence[J]. Journal of the American Society for Information Science, 1999, 50 (2): 162-180.
[11] Liu L, Pu C, Han W. XWRAP:An XML-enable Wrapper Construction System for the Web Information Source[C].In:Proceedings of the 16th IEEE International Conference on Data Engineering,San Diego.2000:611-620.
[12] Lerman K, Knoblock C, Minton S. Automatic Data Extraction from Lists and Tables in Web Sources[C]. In:Proceedings of the Workshop on Advances in Text Extraction and Mining Workshop,Menlo Park.2001.
[13] 王琦,唐世渭,杨冬清,等. 基于DOM的网页主题信息自动提取[J]. 计算机研究与发展, 2004, 41 (10): 1786-1792.
[14] 崔继馨,张鹏,杨文柱. 基于DOM的Web信息抽取[J]. 河北农业大学学报, 2005, 28 (3):90-93.
[15] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报, 2004, 18 (5):17-22.
[16] Cai D,Yu S, Wen J,et al.VIPS:A Vision-based Page Segmentation Algorithm[R].Microsoft Technical Report,MSR-TR-2003-79. 2003.
[17] The Easy Way to Extract Useful Text from Arbitrary HTML[OL].[2010-03-20].http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html.2007.
[18] 宁力. 搜索引擎中网页查重方法的研究[D].北京:北京化工大学,2007.
[19] 钱爱兵,江岚. 基于后缀树的中文新闻重复网页识别算法[J].现代图书情报技术,2008(3):55-61.
[20] Bun K K, Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm[C]. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering.Singapore:IEEE CS Press,2002:73-82.