Please wait a minute...
Advanced Search
现代图书情报技术  2008, Vol. 24 Issue (3): 55-61     https://doi.org/10.11925/infotech.1003-3513.2008.03.10
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于后缀树的中文新闻重复网页识别算法
钱爱兵 江岚
(南京大学信息管理系 南京 210093)
An Algorithm for Detecting Duplicated Chinese Web News Based on Suffix Tree
Qian Aibing   Jiang Lan
(Department of Information Management, Nanjing University, Nanjing 210093, China)
全文: PDF (435 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 

针对识别中文新闻重复网页传统方法的不足,提出以后缀树作为基本数据结构,依据新闻网页的标题性和时间性,构建中文新闻重复网页识别算法。该算法以Ukkonen算法和Matching Statistics算法为基础,并对其具体实现进行优化。实验结果表明,该算法不仅具有有效性,而且对计算字符串相似度也有启发意义。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
钱爱兵
江岚
关键词 后缀树重复网页Ukkonen算法匹配统计算法    
Abstract

In view of the shortcomings of traditional methods for analyzing public opinions, this paper proposes a new idea of public opinion analysis under the Web,and then designs a model for it. Experiments show that the proposed model is an effective solution to analyzing public opinion under the Web.

Key wordsSuffix Tree    Duplicated Web Page    Ukkonen Algorithm    Matching Statistics Algorithm
收稿日期: 2007-12-03      出版日期: 2008-03-25
: 

TP391  

 
  G202

 
通讯作者: 钱爱兵     E-mail: happyfate2001@yahoo.com.cn
作者简介: 钱爱兵,江岚
引用本文:   
钱爱兵,江岚. 基于后缀树的中文新闻重复网页识别算法[J]. 现代图书情报技术, 2008, 24(3): 55-61.
Qian Aibing,Jiang Lan. An Algorithm for Detecting Duplicated Chinese Web News Based on Suffix Tree. New Technology of Library and Information Service, 2008, 24(3): 55-61.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2008.03.10      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2008/V24/I3/55

[1] Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents[C]. In:Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries. Austin, Texas, 1995.
[2] Yan T,  Garcia-Molina H. The Sift Information Dissemination System[J]. ACM Trans. on Database Systems, 1999,24(4):529-565.
[3] Kirriemuir J W, Willett P. Identification of Duplicate and Near–duplicate Full-text Records in Database Search Outputs Using Hierarchic Cluster Analysis[J]. Program, 1995, 29(3):241-256.
[4] Buckley C, Carrie C, Mardis S, et al. The Smart/empire TIPSTER IR System[C]. In:Proceedings of TIPSTER Phase 3 Workshop. San Francisco: Morgan Kaufmann Publishers, 1999:107-121.
[5] 张文涛.www上一种Meta-Search Engine的研究与实现[D].  北京:清华大学,2002.
[6] 张刚,刘挺,郑实福,等. 大规模网页快速去重算法[C]. 中国中文信息学学会二十周年学术会论文集(续集), 2001(11):18-25.
[7] Ukkonen E. On-line Construction of Suffix Trees[J]. Algorithmica, 1995, 14(3):249-260.
[8] Chang W I,  Lawler E L. Sublinear Expected Time Approximate String Matching and Biological Applications[J]. Algorithmica, 1994,12(4):327-344.
[9] Yan T W, Garcia-Molina H. Duplicate Removal in Information Dissemination[C]. In:Proceedings of the 21st International Conference on Very Large Data Bases. Zurich, Switzerland, 1995.
[10] 吴平博,陈群秀,马亮. 基于特征串的大规模中文网页快速去重算法研究[J]. 中文信息学报,2003,17(2):28-35.
[11] 南京大学信息技术开发研究所. 江苏法院网络舆情分析系统[EB/OL].(2007-10-08). [2007-11-08]. http://218.94.26.134.
[12] Weiner P. Linear Pattern Matching Algorithms[C]. In: Proceedings of the 14th IEEE Annual Symposium on Switching and Automata Theory, 1973:1-11.
[13] McCreight E. A Space-economical Suffix Tree Construction Algorithm[J]. Journal of the ACM, 1976, 23(2):262-272.
[14] Gusfield D. Algorithms on Strings, Trees, and Sequence: Computer Science and Computational Biology[M]. New York: Cambridge University Press,1997:87-207.
[15] Drozdek A. Data Structures and Algorithms in Java[M].2nd edition. Beijing: China Machine Press, 2006:707-712.
[16] Zhou Meili. Some Concepts and Mathematical Consideration of Similarity System Theory[J]. Journal of System Science and System Engineering, 1992, 1(1):84-92.
[17] Hirschberg  D S. A Linear Space Algorithm for Computing Maximal Common Subsequences[J]. Communications of the ACM, 1975, 18(6):341-343.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn