[Objective] Remove the data redundancy of cross-database searching in sci-tech novelty retrieval and improve the retrieval efficiency. [Methods] Choose thesis names, serial titles, publication dates and first authors of search records from different databases and build the character strings of search records by modifying comparison algorithm related to I-Match as the evidence of duplicate removal. [Results] The duplicate removal algorithm can improve retrieval effeciency by analyzing and duplicating the retrieval results from different databases. The experient suggests the precision of algorithm is superior, while the recall of the algorithm could be improved by modifying database records. [Limitations] The treatment effect depends on four characters extracted from database search records, different feature extraction model of search records needed to be customized according to different thesis databases due to the search result diffenrence. [Conclusions] The experiment test suggests the algorithm has a decent precision of duplicate removal and treatment efficency, which accords with the requirement of sci-tech retreival.
郝慧. 一种基于科技查新的跨库检索去重算法[J]. 现代图书情报技术, 2015, 31(1): 89-95.
Hao Hui. A Duplicate Removal Algorithm of Cross-database Search Based on Sci-tech Novelty Retrieval. New Technology of Library and Information Service, 2015, 31(1): 89-95.
[1] 谢新洲, 滕跃. 科技查新手册[M]. 北京: 科学技术文献出版社, 2004. (Xie Xinzhou, Teng Yue. Science and Technology Novelty Search Handbook [M]. Beijing: Scientific and Technical Documentation Press, 2004.)
[2] 李雪婷, 李莘, 王晓丹. 基于JAVA 的图书馆中文查新智能去重系统的研究与实现[J]. 图书馆学研究, 2013(17): 56-58. (Li Xueting, Li Shen, Wang Xiaodan. Research and Implementation of Intelligent Duplicate Removal System about Chinese Novelty Search in Library Based on JAVA [J]. Researches on Library Science, 2013(17): 56-58.)
[3] 洪道广. Google Scholar 的数据整合研究[J]. 现代情报, 2010, 30(7): 39-41, 45. (Hong Daoguang. Research on Data Integration of Google Scholar [J]. Journal of Modern Information, 2010, 30(7): 39-41, 45.)
[4] Broder A Z, Glassman S C, Manasse S, et al. Syntactic Clustering of the Web [C]. In: Proceedings of the 6th International World Wide Web Conference. Essex, UK: Elsevier Science Publishers, 1997: 1157-1166.
[5] Broder A Z. Identifying and Filtering Near-duplicate Documents [C]. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (COM'00). London,UK: Springer-Verlag, 2000: 1-10.
[6] Chowdhury A, Frieder O, Grossman D, et al. Collection Statistics for Fast Duplicate Document Detection [J]. ACM Transactions on Information Systems, 2002, 20(2): 171-191.
[7] Charikar M S. Similarity Estimation Techniques from Rounding Algorithms [C]. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). New York, USA: ACM, 2002: 380-388.