互联网侨情信息采集系统设计与实现*

doi:10.11925/infotech.1003-3513.2010.07-08.17

现代图书情报技术

2010, Vol. 26

Issue (7/8): 95-101 https://doi.org/10.11925/infotech.1003-3513.2010.07-08.17

情报分析与研究

本期目录 | 过刊浏览 | 高级检索

互联网侨情信息采集系统设计与实现*

许鑫¹黄仲清¹邓三鸿²

¹（华东师范大学信息学系上海 200241）
²（南京大学信息管理系南京 210093）

Design and Implementation of Internet Information Acquisition System on Overseas Chinese

Xu Xin¹Huang Zhongqing¹Deng Sanhong²

¹（Department of Informatics, East China Normal University, Shanghai 200241，China）
²（Department of Information Management, Nanjing University, Nanjing 210093，China）

摘要
参考文献
相关文章
Metrics

全文: PDF (718 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

采用通用搜索引擎与垂直搜索引擎相结合的互联网主题信息采集策略，提出多种防屏蔽技术相结合的网络采集防屏蔽解决方案，改进一种基于文本密度的网页正文抽取方法，利用基于分词的向量空间模型和余弦夹角公式实现基于内容的标题去重，并设计一个面向侨情的互联网主题信息采集系统。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	许鑫
	黄仲清
	邓三鸿

关键词 ：互联网信息, 信息采集, 正文抽取, 侨情

Abstract：

This paper proposes an anti-shielding solution integrated with different technologies to avoid shielding, improves Web content extraction based on text density, adopts eliminating duplication technology based on VSM and cosine angle formula, and develops a system of the Internet subject acquisition system on overseas Chinese.

Key words： Internet information Information acquisition Text extraction Overseas Chinese information

收稿日期: 2010-06-03 出版日期: 2010-09-19

G354

基金资助:

本文系国务院侨务办公室课题项目“网络侨情智能服务平台”（项目编号：GQBQ2009052）、教育部人文社会科学研究项目“互联网舆情信息分析与管理机制研究”（项目编号：08JC870003）和上海市社会科学规划课题“政务公开信息的网络舆情反馈研究”（项目编号：2009ETQ001）的研究成果之一。

通讯作者: 许鑫 E-mail: xxu@infor.ecnu.edu.cn

作者简介: 许鑫黄仲清邓三鸿

引用本文:

许鑫黄仲清邓三鸿. 互联网侨情信息采集系统设计与实现*[J]. 现代图书情报技术, 2010, 26(7/8): 95-101.
Xu Xin Huang Zhongqing Deng Sanhong. Design and Implementation of Internet Information Acquisition System on Overseas Chinese. New Technology of Library and Information Service, 2010, 26(7/8): 95-101.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2010.07-08.17 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2010/V26/I7/8/95

［1］ Chakrabarti S, Berg M V D，Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery［C］. In：Proceedings of the 8th International World Wide Web Conference, Toronto, Canada.1999.
［2］ Aggarwal C C, Al-Garawi F, Yu P S. Intelligent Crawling on the World Wide Web with Arbitrary Predicates［C］. In: Proceedings of the 10th International World Wide Web Conference, Hong Kong.2001.
［3］ Menczer F,Pant G, Srinivasan P，et al. Evaluating Topic-Driven Web Crawler［C］.In：Proceedings of the 24th Annual International ACM/SIGIR Conference, New Orleans, Louisiana，USA. 2001.
［4］ Nie Z, Zhang Y, Wen J R, et al. Object-Level Ranking Bringing: Order to Web Objects［C］. In: Proceedings of the 14th International Conference on World Wide Web. 2005：567-574.
［5］ Microsoft Academic Search［EB/OL］. ［2010-03-20］. http://academic.research.microsoft.com.
［6］吴清江,吴政,刘琳琅. 面向侨务信息主题的搜索引擎系统［J］. 华侨大学学报：自然科学版, 2006,27(4):429-432.
［7］ Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine［J］. Computer Networks and ISDN Systems，1998,30(1-7):107-117.
［8］ Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing［J］. Communications of ACM, 1995, 18(11): 613-620.
［9］王永成. 中文信息处理技术及其基础［M］. 上海：上海交通大学出版社, 1990.
［10］ Koehler W. An Analysis of Web Page and Web Site Constancy and Permanence［J］. Journal of the American Society for Information Science, 1999, 50 (2): 162-180.
［11］ Liu L, Pu C, Han W. XWRAP:An XML-enable Wrapper Construction System for the Web Information Source［C］.In:Proceedings of the 16th IEEE International Conference on Data Engineering，San Diego.2000:611-620.
［12］ Lerman K, Knoblock C, Minton S. Automatic Data Extraction from Lists and Tables in Web Sources［C］. In:Proceedings of the Workshop on Advances in Text Extraction and Mining Workshop，Menlo Park.2001.
［13］王琦,唐世渭,杨冬清,等. 基于DOM的网页主题信息自动提取［J］. 计算机研究与发展, 2004, 41 (10): 1786-1792.
［14］崔继馨,张鹏,杨文柱. 基于DOM的Web信息抽取［J］. 河北农业大学学报, 2005, 28 (3):90-93.
［15］孙承杰,关毅.基于统计的网页正文信息抽取方法的研究［J］.中文信息学报, 2004, 18 (5):17-22.
［16］ Cai D,Yu S, Wen J,et al.VIPS:A Vision-based Page Segmentation Algorithm［R］.Microsoft Technical Report,MSR-TR-2003-79. 2003.
［17］ The Easy Way to Extract Useful Text from Arbitrary HTML［OL］.［2010-03-20］.http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html.2007.
［18］宁力. 搜索引擎中网页查重方法的研究［D］.北京:北京化工大学,2007.
［19］钱爱兵,江岚. 基于后缀树的中文新闻重复网页识别算法［J］.现代图书情报技术,2008(3):55-61.
［20］ Bun K K, Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm［C］. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering.Singapore:IEEE CS Press,2002:73-82.

[1]	王思丽, 刘巍, 祝忠明, 吴志强, 王金平. 基于CSpace的科技信息可配置化自动监测功能设计与实现^*[J]. 数据分析与知识发现, 2017, 1(10): 85-93.
[2]	潘竹虹,萧德洪. 一种支持双栈及高速网络的数字资源利用分析系统数据过滤方法[J]. 现代图书情报技术, 2016, 32(3): 90-96.
[3]	武海东, 何晓阳, 张精理. 医学学术信息自动采集系统的设计与实现[J]. 现代图书情报技术, 2014, 30(11): 73-78.
[4]	吴红, 王凤英, 付秀颖. 面向专利分析的法律状态分布式采集系统的设计与实现[J]. 现代图书情报技术, 2012, (12): 66-71.
[5]	陈诗琴李文江. 基于.Net的农产品市场行情信息采集 ——以重庆农产品市场行情查询网为例[J]. 现代图书情报技术, 2010, 26(6): 88-92.
[6]	黄进. 图书馆应用系统监控的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 90-94.
[7]	许鑫,黄仲清. *垂直搜索引擎应用中的若干策略探讨——以12580餐饮垂直搜索为例**[J]. 现代图书情报技术, 2009, 3(2): 62-70.
[8]	沈劲枝,寇文波,田晨耕. 基于特征定位边界预测的Web档案正文采集*[J]. 现代图书情报技术, 2009, 25(12): 52-56.
[9]	钱爱兵. 基于主题的网络舆情分析模型及其实现[J]. 现代图书情报技术, 2008, 24(4): 49-55.
[10]	徐德智,王庆涛,王斌 . 基于本体的Web信息采集*[J]. 现代图书情报技术, 2007, 2(2): 53-55.
[11]	刘莉,肖诗斌,王涛,施水才. 基于RSS的博客采集系统的设计与实现*[J]. 现代图书情报技术, 2007, 2(11): 45-48.
[12]	吴金红,张玉峰,王翠波 . 面向主题的网络竞争情报采集系统*[J]. 现代图书情报技术, 2006, 1(12): 54-57.
[13]	邵晓良,刘红. Web主题信息采集中信息主题的识别[J]. 现代图书情报技术, 2004, 20(10): 51-54.
[14]	梁奋东. 论公共图书馆网站建设[J]. 现代图书情报技术, 2002, 18(4): 49-50.
[15]	李培,赵麟. 网上证券金融信息采集系统的研究[J]. 现代图书情报技术, 2001, 17(6): 56-59.

Viewed

Full text

Abstract

Cited

Shared

Discussed