基于IRPU算法的专利数据相似重复属性及记录检测方法

doi:10.11925/infotech.1003-3513.2010.12.08

现代图书情报技术

2010, Vol. 26

Issue (12): 46-51 https://doi.org/10.11925/infotech.1003-3513.2010.12.08

情报分析与研究

本期目录 | 过刊浏览 | 高级检索

基于IRPU算法的专利数据相似重复属性及记录检测方法

雷孝平, 张旭, 赵蕴华, 郑佳

中国科学技术信息研究所北京 100038

The Method of Patent Data Approximately Duplicate Attributes and Records Detecting Based on IRPU Algorithm

Lei Xiaoping, Zhang Xu, Zhao Yunhua, Zheng Jia

Institute of Scientific & Technical Information of China, Beijing 100038, China

摘要
参考文献
相关文章
Metrics

全文: PDF (444 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

面向专利数据领域,从专利文献自身的特点及专利分析需求出发,基于RFMA算法和PCM算法提出一种改进的专利数据相似重复属性及记录检测方法,即IRPU算法。将该算法应用到专利数据中,对发明人属性和整体记录进行检测。实验结果表明,该方法适用于专利数据领域,具有较高的识别精度。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	雷孝平
	张旭
	赵蕴华
	郑佳

关键词 ：数据清洗, 相似重复记录, 相似重复属性, 位置编码, 专利

Abstract：

Oriented to patent data fields, taking the characteristics of patent document and the requirement of patent analysis into account, this paper puts forward an improved method of patent data approximately duplicate attributes and records detecting based on RFMA algorithm and PCM algorithm, which is IRPU algorithm. Then IRPU algorithm is applied in patent data to detect inventor attribute and whole record. Experimental comparison with the previous work indicates that the proposed method is fit for patent data field and the identification accuracy is higher.

Key words： Data cleaning Approximately duplicate records Approximately duplicate attributes Position coding

收稿日期: 2010-09-13 出版日期: 2011-01-07

:	N99
	TP311

基金资助:

本文系中国博士后科学基金资助课题“面向战略性技术管理的专利分析体系研究”(项目编号:20100470389)的研究成果之一。

引用本文:

雷孝平, 张旭, 赵蕴华, 郑佳. 基于IRPU算法的专利数据相似重复属性及记录检测方法[J]. 现代图书情报技术, 2010, 26(12): 46-51.
Lei Xiaoping, Zhang Xu, Zhao Yunhua, Zheng Jia. The Method of Patent Data Approximately Duplicate Attributes and Records Detecting Based on IRPU Algorithm. New Technology of Library and Information Service, 2010, 26(12): 46-51.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2010.12.08 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2010/V26/I12/46

[1] Monge A. An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records. (2007-09-02). http://citeseer.ist.psu.edu/mongeovadaptive.html.

[2] Monge A E, Elkan C P. An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona,United States. 1997: 23-29.

[3] Foulonneau M. Information Redundancy Across Metadata Collections
[J]. Information Processing and Management, 2007, 43(3):740-751.

[4] Liang J, Chen L, Mehrotra S. Efficient Record Linkage in Large Data Sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications,Kyoto, Japan.2003: 137-148.

[5] Chandhurt S, Ganjam K, Ganti V, et al. Robust and Efficient Fuzzy Match for Online Data Cleaning.In: Proceedings of ACM SIGMOD International Conference Management of Data. New York: ACM Press,2003:313-324.

[6] Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 1995:127-138.

[7] 王常武,韩菁华,张付志. 一种相似重复元数据记录检测方法
[J]. 计算机工程, 2009, 35(21): 85-87.

[8] 时念云, 张金明, 褚希. 基于CRUE算法的相似重复记录检测
[J]. 计算机工程, 2009, 35(5): 56-58.

[9] 周丽娟, 肖满生. 基于数据分组匹配的相似重复记录检测
[J]. 计算机工程, 2010, 36(12): 104-106.

[10] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Databases. London: Springer Verlag, 1996:267-270.

[11] Needleman S B, Wunsch C D.A General Method Application to the Search for Similarities in the Amino Sequence of Two Proteins
[J].Journal of Molecular Biology, 1970, 48(3):443—453.

[12] 陈细谦,迟忠先,昃宗亮,等. 地理编码在空间数据仓库ETL中的应用
[J]. 小型微型计算机系统,2005,26(4):628-630.

[13] 张永,迟忠先,闫德勤. 数据仓库中相似重复记录的检测方法及应用
[J]. 计算机应用, 2006,26(4):880-882.

[14] 张永,迟忠先. 位置编码在数据仓库 ETL中的应用
[J]. 计算机工程, 2007,33(1):50-52.

[1]	张乐, 冷基栋, 吕学强, 崔卓, 王磊, 游新冬. RLCPAR：一种基于强化学习的中文专利摘要改写模型*[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[2]	高伊林,闵超. 中美对“一带一路”沿线技术扩散结构比较研究^*[J]. 数据分析与知识发现, 2021, 5(6): 80-92.
[3]	吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述^*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[4]	陈浩, 张梦毅, 程秀峰. *融合主题模型与决策树的跨地区专利合作关系发现与推荐^——以广东省和武汉市高校专利库为例**[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[5]	关鹏,王曰芬,靳嘉林,傅柱. 专利合作视角下技术创新合作网络演化分析——以国内语音识别技术领域为例*[J]. 数据分析与知识发现, 2021, 5(1): 112-127.
[6]	胡勇军,韦婷婷,窦子欣,黄芸茵,梁锐成,常会友. *广东刀剪产业转型升级技术发展路径研究^——基于专利TRIZ分析**[J]. 数据分析与知识发现, 2020, 4(2/3): 101-109.
[7]	张金柱,主立鹏,刘菁婕. 基于表示学习的无监督跨语言专利推荐研究^*[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[8]	李家全,李宝安,游新冬,吕学强. 基于专利知识图谱的专利术语相似度计算研究^*[J]. 数据分析与知识发现, 2020, 4(10): 104-112.
[9]	关鹏,王曰芬. 国内外专利网络研究进展*[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[10]	俞琰,陈磊,姜金德,赵乃瑄. 结合词向量和统计特征的专利相似度测量方法 ^*[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[11]	侯剑华,刘盼. 专利技术系统演化的技术熵测度模型与实证研究 ^*[J]. 数据分析与知识发现, 2019, 3(8): 21-29.
[12]	周成,魏红芹. *专利价值评估与分类研究^——基于自组织映射支持向量机**[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[13]	张金柱,胡一鸣. 融合表示学习与机器学习的专利科学引文标题自动抽取研究^*[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[14]	张杰,赵君博,翟东升,孙宁宁. 基于主题模型的微藻生物燃料产业链专利技术分析^*[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[15]	张金柱,王玥,胡一鸣. 基于专利科学引文内容表示学习的科学技术主题关联分析研究 ^*[J]. 数据分析与知识发现, 2019, 3(12): 52-60.

Viewed

Full text

Abstract

Cited

Shared

Discussed