Oriented to patent data fields, taking the characteristics of patent document and the requirement of patent analysis into account, this paper puts forward an improved method of patent data approximately duplicate attributes and records detecting based on RFMA algorithm and PCM algorithm, which is IRPU algorithm. Then IRPU algorithm is applied in patent data to detect inventor attribute and whole record. Experimental comparison with the previous work indicates that the proposed method is fit for patent data field and the identification accuracy is higher.
雷孝平, 张旭, 赵蕴华, 郑佳. 基于IRPU算法的专利数据相似重复属性及记录检测方法[J]. 现代图书情报技术, 2010, 26(12): 46-51.
Lei Xiaoping, Zhang Xu, Zhao Yunhua, Zheng Jia. The Method of Patent Data Approximately Duplicate Attributes and Records Detecting Based on IRPU Algorithm. New Technology of Library and Information Service, 2010, 26(12): 46-51.
[1] Monge A. An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records. (2007-09-02). http://citeseer.ist.psu.edu/mongeovadaptive.html.
[2] Monge A E, Elkan C P. An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona,United States. 1997: 23-29.
[3] Foulonneau M. Information Redundancy Across Metadata Collections [J]. Information Processing and Management, 2007, 43(3):740-751.
[4] Liang J, Chen L, Mehrotra S. Efficient Record Linkage in Large Data Sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications,Kyoto, Japan.2003: 137-148.
[5] Chandhurt S, Ganjam K, Ganti V, et al. Robust and Efficient Fuzzy Match for Online Data Cleaning.In: Proceedings of ACM SIGMOD International Conference Management of Data. New York: ACM Press,2003:313-324.
[6] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 1995:127-138.
[10] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Databases. London: Springer Verlag, 1996:267-270.
[11] Needleman S B, Wunsch C D.A General Method Application to the Search for Similarities in the Amino Sequence of Two Proteins [J].Journal of Molecular Biology, 1970, 48(3):443—453.