Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (1): 89-95    DOI: 10.11925/infotech.1003-3513.2015.01.13
Current Issue | Archive | Adv Search |
A Duplicate Removal Algorithm of Cross-database Search Based on Sci-tech Novelty Retrieval
Hao Hui
Beijing University of Technology Library, Beijing 100124, China
Export: BibTeX | EndNote (RIS)      

[Objective] Remove the data redundancy of cross-database searching in sci-tech novelty retrieval and improve the retrieval efficiency. [Methods] Choose thesis names, serial titles, publication dates and first authors of search records from different databases and build the character strings of search records by modifying comparison algorithm related to I-Match as the evidence of duplicate removal. [Results] The duplicate removal algorithm can improve retrieval effeciency by analyzing and duplicating the retrieval results from different databases. The experient suggests the precision of algorithm is superior, while the recall of the algorithm could be improved by modifying database records. [Limitations] The treatment effect depends on four characters extracted from database search records, different feature extraction model of search records needed to be customized according to different thesis databases due to the search result diffenrence. [Conclusions] The experiment test suggests the algorithm has a decent precision of duplicate removal and treatment efficency, which accords with the requirement of sci-tech retreival.

Key wordsCross-database search      Sci-tech novelty retrieval      Duplicate removal algorithm      I-Match     
Received: 21 July 2014      Published: 12 February 2015
:  G250  

Cite this article:

Hao Hui. A Duplicate Removal Algorithm of Cross-database Search Based on Sci-tech Novelty Retrieval. New Technology of Library and Information Service, 2015, 31(1): 89-95.

URL:     OR

[1] 谢新洲, 滕跃. 科技查新手册[M]. 北京: 科学技术文献出版社, 2004. (Xie Xinzhou, Teng Yue. Science and Technology Novelty Search Handbook [M]. Beijing: Scientific and Technical Documentation Press, 2004.)
[2] 李雪婷, 李莘, 王晓丹. 基于JAVA 的图书馆中文查新智能去重系统的研究与实现[J]. 图书馆学研究, 2013(17): 56-58. (Li Xueting, Li Shen, Wang Xiaodan. Research and Implementation of Intelligent Duplicate Removal System about Chinese Novelty Search in Library Based on JAVA [J]. Researches on Library Science, 2013(17): 56-58.)
[3] 洪道广. Google Scholar 的数据整合研究[J]. 现代情报, 2010, 30(7): 39-41, 45. (Hong Daoguang. Research on Data Integration of Google Scholar [J]. Journal of Modern Information, 2010, 30(7): 39-41, 45.)
[4] Broder A Z, Glassman S C, Manasse S, et al. Syntactic Clustering of the Web [C]. In: Proceedings of the 6th International World Wide Web Conference. Essex, UK: Elsevier Science Publishers, 1997: 1157-1166.
[5] Broder A Z. Identifying and Filtering Near-duplicate Documents [C]. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (COM'00). London,UK: Springer-Verlag, 2000: 1-10.
[6] Chowdhury A, Frieder O, Grossman D, et al. Collection Statistics for Fast Duplicate Document Detection [J]. ACM Transactions on Information Systems, 2002, 20(2): 171-191.
[7] Charikar M S. Similarity Estimation Techniques from Rounding Algorithms [C]. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). New York, USA: ACM, 2002: 380-388.

[1] Junliang Yao,Xiaoqiu Le. Semantic Matching for Sci-Tech Novelty Retrieval[J]. 数据分析与知识发现, 2019, 3(6): 50-56.
[2] Wang Peixia,Yu Hai,Chen Li,Wang Yongji. Using Intelligent System to Extract Search Terms for Sci-Tech Novelty Retrieval[J]. 现代图书情报技术, 2016, 32(11): 82-93.
[3] Hao Dan, Zhou Jinhui, Guan Bei, Wang Yanxi, Han Jixin. Research on Duplicated Literature Deletion Method Based on Cross-database Search[J]. 现代图书情报技术, 2011, 27(7/8): 116-120.
[4] Li Sa. Light-weight Intellectual Selection of Resources and Its Application[J]. 现代图书情报技术, 2005, 21(10): 19-22.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938