Please wait a minute...
New Technology of Library and Information Service  2010, Vol. 26 Issue (7/8): 95-101    DOI: 10.11925/infotech.1003-3513.2010.07-08.17
article Current Issue | Archive | Adv Search |
Design and Implementation of Internet Information Acquisition System on Overseas Chinese
Xu XinHuang ZhongqingDeng Sanhong2
1(Department of Informatics, East China Normal University, Shanghai 200241,China)
2(Department of Information Management, Nanjing University, Nanjing 210093,China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

 This paper proposes an anti-shielding solution integrated with different technologies to avoid shielding, improves Web content extraction based on text density, adopts eliminating duplication technology based on VSM and cosine angle formula, and develops a system of the Internet subject acquisition system on overseas Chinese.

Key words Internet information      Information acquisition      Text extraction      Overseas Chinese information     
Received: 03 June 2010      Published: 19 September 2010
: 

G354

 
Corresponding Authors: Xu Xin     E-mail: xxu@infor.ecnu.edu.cn
About author:: Xu Xin Huang Zhongqing Deng Sanhong

Cite this article:

Xu Xin Huang Zhongqing Deng Sanhong. Design and Implementation of Internet Information Acquisition System on Overseas Chinese. New Technology of Library and Information Service, 2010, 26(7/8): 95-101.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2010.07-08.17     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2010/V26/I7/8/95

[1] Chakrabarti S, Berg M V D,Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[C]. In:Proceedings of the 8th International World Wide Web Conference, Toronto, Canada.1999.
[2] Aggarwal C C, Al-Garawi F, Yu  P S.  Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C]. In: Proceedings of the 10th International World Wide Web Conference, Hong Kong.2001.
[3] Menczer F,Pant G, Srinivasan P,et al. Evaluating Topic-Driven Web Crawler[C].In:Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, Louisiana,USA. 2001.
[4] Nie Z, Zhang Y, Wen J R, et al. Object-Level Ranking Bringing: Order to Web Objects[C]. In: Proceedings of the 14th International Conference on World Wide Web. 2005:567-574.
[5] Microsoft Academic Search[EB/OL]. [2010-03-20]. http://academic.research.microsoft.com.
[6] 吴清江,吴政,刘琳琅. 面向侨务信息主题的搜索引擎系统[J]. 华侨大学学报:自然科学版, 2006,27(4):429-432.
[7] Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems,1998,30(1-7):107-117.
[8] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of ACM, 1995, 18(11): 613-620.
[9] 王永成. 中文信息处理技术及其基础[M]. 上海:上海交通大学出版社, 1990.
[10] Koehler W. An Analysis of Web Page and Web Site Constancy and Permanence[J]. Journal of the American Society for Information Science, 1999, 50 (2): 162-180.
[11] Liu L, Pu C, Han W. XWRAP:An XML-enable Wrapper Construction System for the Web Information Source[C].In:Proceedings of the 16th IEEE International Conference on Data Engineering,San Diego.2000:611-620.
[12] Lerman K, Knoblock C, Minton S. Automatic Data Extraction from Lists and Tables in Web Sources[C]. In:Proceedings of the Workshop on Advances in Text Extraction and Mining Workshop,Menlo Park.2001.
[13] 王琦,唐世渭,杨冬清,等. 基于DOM的网页主题信息自动提取[J]. 计算机研究与发展, 2004, 41 (10): 1786-1792.
[14] 崔继馨,张鹏,杨文柱. 基于DOM的Web信息抽取[J]. 河北农业大学学报, 2005, 28 (3):90-93.
[15] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报, 2004, 18 (5):17-22.
[16] Cai D,Yu S, Wen J,et al.VIPS:A Vision-based Page Segmentation Algorithm[R].Microsoft Technical Report,MSR-TR-2003-79. 2003.
[17] The Easy Way to Extract Useful Text from Arbitrary HTML[OL].[2010-03-20].http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html.2007.
[18] 宁力. 搜索引擎中网页查重方法的研究[D].北京:北京化工大学,2007.
[19] 钱爱兵,江岚. 基于后缀树的中文新闻重复网页识别算法[J].现代图书情报技术,2008(3):55-61.
[20] Bun K K, Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm[C]. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering.Singapore:IEEE CS Press,2002:73-82.

[1] Wang Sili,Liu Wei,Zhu Zhongming,Wu Zhiqiang,Wang Jinping. Tracking Scientific Information with CSpace Technology[J]. 数据分析与知识发现, 2017, 1(10): 85-93.
[2] Pan Zhuhong,Xiao Dehong. Data Filtering Method for Digital Resource Usage Analysis System for Dual Stack and High Speed Network[J]. 现代图书情报技术, 2016, 32(3): 90-96.
[3] Xu Xin,Huang Zhongqing. Research on the Policy of Vertical Search Engine Application——An Example of 12580 Search Engine[J]. 现代图书情报技术, 2009, 3(2): 62-70.
[4] Wu Jinhong,Zhang Yufeng,Wang Cuibo . Topic-focused Web Competitive Intelligence Acquisition System[J]. 现代图书情报技术, 2006, 1(12): 54-57.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn