Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (3): 55-61    DOI: 10.11925/infotech.1003-3513.2008.03.10
Current Issue | Archive | Adv Search |
An Algorithm for Detecting Duplicated Chinese Web News Based on Suffix Tree
Qian Aibing   Jiang Lan
(Department of Information Management, Nanjing University, Nanjing 210093, China)
Download: PDF(435 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

In view of the shortcomings of traditional methods for analyzing public opinions, this paper proposes a new idea of public opinion analysis under the Web,and then designs a model for it. Experiments show that the proposed model is an effective solution to analyzing public opinion under the Web.

Key wordsSuffix Tree      Duplicated Web Page      Ukkonen Algorithm      Matching Statistics Algorithm     
Received: 03 December 2007      Published: 25 March 2008
: 

TP391  

 
  G202

 
Corresponding Authors: Qian Aibing     E-mail: happyfate2001@yahoo.com.cn
About author:: Qian Aibing,Jiang Lan

Cite this article:

Qian Aibing,Jiang Lan. An Algorithm for Detecting Duplicated Chinese Web News Based on Suffix Tree. New Technology of Library and Information Service, 2008, 24(3): 55-61.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2008.03.10     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2008/V24/I3/55

[1] Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents[C]. In:Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries. Austin, Texas, 1995.
[2] Yan T,  Garcia-Molina H. The Sift Information Dissemination System[J]. ACM Trans. on Database Systems, 1999,24(4):529-565.
[3] Kirriemuir J W, Willett P. Identification of Duplicate and Near–duplicate Full-text Records in Database Search Outputs Using Hierarchic Cluster Analysis[J]. Program, 1995, 29(3):241-256.
[4] Buckley C, Carrie C, Mardis S, et al. The Smart/empire TIPSTER IR System[C]. In:Proceedings of TIPSTER Phase 3 Workshop. San Francisco: Morgan Kaufmann Publishers, 1999:107-121.
[5] 张文涛.www上一种Meta-Search Engine的研究与实现[D].  北京:清华大学,2002.
[6] 张刚,刘挺,郑实福,等. 大规模网页快速去重算法[C]. 中国中文信息学学会二十周年学术会论文集(续集), 2001(11):18-25.
[7] Ukkonen E. On-line Construction of Suffix Trees[J]. Algorithmica, 1995, 14(3):249-260.
[8] Chang W I,  Lawler E L. Sublinear Expected Time Approximate String Matching and Biological Applications[J]. Algorithmica, 1994,12(4):327-344.
[9] Yan T W, Garcia-Molina H. Duplicate Removal in Information Dissemination[C]. In:Proceedings of the 21st International Conference on Very Large Data Bases. Zurich, Switzerland, 1995.
[10] 吴平博,陈群秀,马亮. 基于特征串的大规模中文网页快速去重算法研究[J]. 中文信息学报,2003,17(2):28-35.
[11] 南京大学信息技术开发研究所. 江苏法院网络舆情分析系统[EB/OL].(2007-10-08). [2007-11-08]. http://218.94.26.134.
[12] Weiner P. Linear Pattern Matching Algorithms[C]. In: Proceedings of the 14th IEEE Annual Symposium on Switching and Automata Theory, 1973:1-11.
[13] McCreight E. A Space-economical Suffix Tree Construction Algorithm[J]. Journal of the ACM, 1976, 23(2):262-272.
[14] Gusfield D. Algorithms on Strings, Trees, and Sequence: Computer Science and Computational Biology[M]. New York: Cambridge University Press,1997:87-207.
[15] Drozdek A. Data Structures and Algorithms in Java[M].2nd edition. Beijing: China Machine Press, 2006:707-712.
[16] Zhou Meili. Some Concepts and Mathematical Consideration of Similarity System Theory[J]. Journal of System Science and System Engineering, 1992, 1(1):84-92.
[17] Hirschberg  D S. A Linear Space Algorithm for Computing Maximal Common Subsequences[J]. Communications of the ACM, 1975, 18(6):341-343.

No related articles found!
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn