The Method of Patent Data Approximately Duplicate Attributes and Records Detecting Based on IRPU Algorithm

doi:10.11925/infotech.1003-3513.2010.12.08

New Technology of Library and Information Service

2010, Vol. 26

Issue (12): 46-51 DOI: 10.11925/infotech.1003-3513.2010.12.08

article

Current Issue | Archive | Adv Search

The Method of Patent Data Approximately Duplicate Attributes and Records Detecting Based on IRPU Algorithm

Lei Xiaoping, Zhang Xu, Zhao Yunhua, Zheng Jia

Institute of Scientific & Technical Information of China, Beijing 100038, China

Download:
Export: BibTeX | EndNote (RIS)

Abstract

Oriented to patent data fields, taking the characteristics of patent document and the requirement of patent analysis into account, this paper puts forward an improved method of patent data approximately duplicate attributes and records detecting based on RFMA algorithm and PCM algorithm, which is IRPU algorithm. Then IRPU algorithm is applied in patent data to detect inventor attribute and whole record. Experimental comparison with the previous work indicates that the proposed method is fit for patent data field and the identification accuracy is higher.

Key words： Data cleaning Approximately duplicate records Approximately duplicate attributes Position coding

Received: 13 September 2010 Published: 07 January 2011

:	N99
	TP311

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Lei Xiaoping
	Zhang Xu
	Zhao Yunhua
	Zheng Jia

Cite this article:

Lei Xiaoping, Zhang Xu, Zhao Yunhua, Zheng Jia. The Method of Patent Data Approximately Duplicate Attributes and Records Detecting Based on IRPU Algorithm. New Technology of Library and Information Service, 2010, 26(12): 46-51.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2010.12.08 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2010/V26/I12/46

[1] Monge A. An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records. (2007-09-02). http://citeseer.ist.psu.edu/mongeovadaptive.html.

[2] Monge A E, Elkan C P. An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona,United States. 1997: 23-29.

[3] Foulonneau M. Information Redundancy Across Metadata Collections
[J]. Information Processing and Management, 2007, 43(3):740-751.

[4] Liang J, Chen L, Mehrotra S. Efficient Record Linkage in Large Data Sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications,Kyoto, Japan.2003: 137-148.

[5] Chandhurt S, Ganjam K, Ganti V, et al. Robust and Efficient Fuzzy Match for Online Data Cleaning.In: Proceedings of ACM SIGMOD International Conference Management of Data. New York: ACM Press,2003:313-324.

[6] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 1995:127-138.

[7] 王常武,韩菁华,张付志. 一种相似重复元数据记录检测方法
[J]. 计算机工程, 2009, 35(21): 85-87.

[8] 时念云, 张金明, 褚希. 基于CRUE算法的相似重复记录检测
[J]. 计算机工程, 2009, 35(5): 56-58.

[9] 周丽娟, 肖满生. 基于数据分组匹配的相似重复记录检测
[J]. 计算机工程, 2010, 36(12): 104-106.

[10] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Databases. London: Springer Verlag, 1996:267-270.

[11] Needleman S B, Wunsch C D.A General Method Application to the Search for Similarities in the Amino Sequence of Two Proteins
[J].Journal of Molecular Biology, 1970, 48(3):443—453.

[12] 陈细谦,迟忠先,昃宗亮,等. 地理编码在空间数据仓库ETL中的应用
[J]. 小型微型计算机系统,2005,26(4):628-630.

[13] 张永,迟忠先,闫德勤. 数据仓库中相似重复记录的检测方法及应用
[J]. 计算机应用, 2006,26(4):880-882.

[14] 张永,迟忠先. 位置编码在数据仓库 ETL中的应用
[J]. 计算机工程, 2007,33(1):50-52.

[1]	Fan Shaoping,Zhao Yuxuan,An Xinying,Wu Qingqiang. Classification Model for Medical Entity Relations with Convolutional Neural Network[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2]	Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[3]	Han Hui, Liu Xiuwen. Automatic Scoring for Subjective Questions in Maritime Competency Assessment[J]. 数据分析与知识发现, 2021, 5(8): 113-121.
[4]	Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao. Predicting Drug ADMET Properties Based on Graph Attention Network[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[5]	Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[6]	Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[7]	Dong Mei,Chang Zhijun,Zhang Runjie. A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature[J]. 数据分析与知识发现, 2021, 5(6): 135-144.
[8]	Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[9]	Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[10]	Chang Zhijun,Qian Li,Xie Jing,Wu Zhenxin,Zhang Hu,Yu Qianqian,Wang Ying,Wang Yongji. Big Data Platform for Sci-Tech Literature Based on Distributed Technology[J]. 数据分析与知识发现, 2021, 5(3): 69-77.
[11]	Xie Wang, Wang Lizhen, Chen Hongmei, Zeng Lanqing. Identifying Relationship Between Pollution Sources and Cancer Cases with Spatial Ordered Pair Patterns[J]. 数据分析与知识发现, 2021, 5(2): 14-31.
[12]	Shen Wang, Li Shiyu, Liu Jiayu, Li He. Optimizing Quality Evaluation for Answers of Q&A Community[J]. 数据分析与知识发现, 2021, 5(2): 83-93.
[13]	Qiu Yunfei, Guo Lei. Predicting Diabetic Complications with Unbalanced Data[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[14]	Li Xiao, Qu Jiansheng. Review of Application and Evolution of Meta-Analysis in Social Sciences[J]. 数据分析与知识发现, 2021, 5(11): 1-12.
[15]	Wu Yanwen, Cai Qiuting, Liu Zhi, Deng Yunze. Digital Resource Recommendation Based on Multi-Source Data and Scene Similarity Calculation[J]. 数据分析与知识发现, 2021, 5(11): 114-123.

Viewed

Full text

Abstract

Cited

Shared

Discussed