|
|
The Method of Patent Data Approximately Duplicate Attributes and Records Detecting Based on IRPU Algorithm |
Lei Xiaoping, Zhang Xu, Zhao Yunhua, Zheng Jia |
Institute of Scientific & Technical Information of China, Beijing 100038, China |
|
|
Abstract Oriented to patent data fields, taking the characteristics of patent document and the requirement of patent analysis into account, this paper puts forward an improved method of patent data approximately duplicate attributes and records detecting based on RFMA algorithm and PCM algorithm, which is IRPU algorithm. Then IRPU algorithm is applied in patent data to detect inventor attribute and whole record. Experimental comparison with the previous work indicates that the proposed method is fit for patent data field and the identification accuracy is higher.
|
Received: 13 September 2010
Published: 07 January 2011
|
|
[1] Monge A. An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records. (2007-09-02). http://citeseer.ist.psu.edu/mongeovadaptive.html.
[2] Monge A E, Elkan C P. An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona,United States. 1997: 23-29.
[3] Foulonneau M. Information Redundancy Across Metadata Collections [J]. Information Processing and Management, 2007, 43(3):740-751.
[4] Liang J, Chen L, Mehrotra S. Efficient Record Linkage in Large Data Sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications,Kyoto, Japan.2003: 137-148.
[5] Chandhurt S, Ganjam K, Ganti V, et al. Robust and Efficient Fuzzy Match for Online Data Cleaning.In: Proceedings of ACM SIGMOD International Conference Management of Data. New York: ACM Press,2003:313-324.
[6] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 1995:127-138.
[7] 王常武,韩菁华,张付志. 一种相似重复元数据记录检测方法 [J]. 计算机工程, 2009, 35(21): 85-87.
[8] 时念云, 张金明, 褚希. 基于CRUE算法的相似重复记录检测 [J]. 计算机工程, 2009, 35(5): 56-58.
[9] 周丽娟, 肖满生. 基于数据分组匹配的相似重复记录检测 [J]. 计算机工程, 2010, 36(12): 104-106.
[10] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Databases. London: Springer Verlag, 1996:267-270.
[11] Needleman S B, Wunsch C D.A General Method Application to the Search for Similarities in the Amino Sequence of Two Proteins [J].Journal of Molecular Biology, 1970, 48(3):443—453.
[12] 陈细谦,迟忠先,昃宗亮,等. 地理编码在空间数据仓库ETL中的应用 [J]. 小型微型计算机系统,2005,26(4):628-630.
[13] 张永,迟忠先,闫德勤. 数据仓库中相似重复记录的检测方法及应用 [J]. 计算机应用, 2006,26(4):880-882.
[14] 张永,迟忠先. 位置编码在数据仓库 ETL中的应用 [J]. 计算机工程, 2007,33(1):50-52.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|