Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (2): 33-43     https://doi.org/10.11925/infotech.2096-3467.2022.1340
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于权利要求层级特征的专利相似度计算方法研究*
向姝璇1(),操玉杰2,毛进3,4
1南京大学数据智能与交叉创新实验室 南京 210023
2华中师范大学信息管理学院 武汉 430074
3武汉大学信息管理学院 武汉 430072
4武汉大学信息资源研究中心 武汉 430072
Computing Patent Similarity Based on Hierarchical Feature of Claims
Xiang Shuxuan1(),Cao Yujie2,Mao Jin3,4
1Laboratory of Data Intelligence and Interdisciplinary Innovation, Nanjing University, Nanjing 210023, China
2School of Information Management, Central China Normal University, Wuhan 430074, China
3School of Information Management, Wuhan University, Wuhan 430072, China
4Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
全文: PDF (1058 KB)   HTML ( 12
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 现有专利相似度计算方法对专利文本独有特征利用不足,并一定程度上忽视了专利内容与结构的特性,本文就上述问题提出一种新的专利相似度计算方法。【方法】 通过权利要求层级特征生成技术组合句并进行信息核心度、信息丰富度的加权,兼顾技术内容范围与技术信息重点进行专利表示,在此基础上进行专利相似度计算。通过相关性指标与专利分类的对比实验证明方法的合理性。【结果】 本文提出的方法较同类基准方法可以更充分地表达专利信息,更适用于专利相似度计算;技术组合句的重构对模型表现提升作用明显,在该基础上的信息核心度、信息丰富度的加权能进一步提高模型表现。【局限】 仅在量子计算领域进行实验,技术领域是否会对方法表现造成影响仍待探究。【结论】 权利要求树与技术组合句的信息组织形式能够提高专利文本的利用效率;基于专利权利要求层级特征的技术组合句与对应信息特征加权能够提升专利表示效果及其在相似度任务中的表现。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
向姝璇
操玉杰
毛进
关键词 权利要求专利相似度权利要求层级    
Abstract

[Objective] This paper proposes a new model to compute patent similarity, which fully leverages the characteristics of patent texts and their structural and context features. [Methods] First, we used technical compound sentences, the weighting of information core degree, and information richness to represent patents. Then, we calculated patent-to-patent similarity with the representation. Finally, we conducted comparative experiments with correlation scores and patent classification. [Results] The proposed method outperformed benchmark methods in computing patent similarities. The technical compound sentences and weighting of information core degree and richness further improved the model's performance. [Limitations] We only examined the model with quantum computing. [Conclusions] Using a claim tree and technical compound sentences to organize patent information can improve the efficiency of patent text processing. The weighting of information core degree and richness based on hierarchical features of patents can improve their representation and patent similarity computing tasks.

Key wordsPatent Claims    Patent Similarity    Hierarchy of Claims
收稿日期: 2022-12-19      出版日期: 2023-05-16
ZTFLH:  TP393  
  G255  
基金资助:*国家自然科学基金创新研究群体项目(71921002);湖湘高层次人才聚集计划项目(2021RC5029)
通讯作者: 向姝璇,ORCID:0000-0002-3259-7169,E-mail:xsx@smail.nju.edu.cn。   
引用本文:   
向姝璇, 操玉杰, 毛进. 基于权利要求层级特征的专利相似度计算方法研究*[J]. 数据分析与知识发现, 2024, 8(2): 33-43.
Xiang Shuxuan, Cao Yujie, Mao Jin. Computing Patent Similarity Based on Hierarchical Feature of Claims. Data Analysis and Knowledge Discovery, 2024, 8(2): 33-43.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1340      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I2/33
Fig.1  专利权利要求示意
Fig.2  专利权利要求树
Fig.3  技术组合句构建
Fig.4  基于技术组合句信息核心度与丰富度加权的专利表示
方法 内容
Base 通过SimCSE-Unsup对权利要求全文生成专利表示
OS 通过SimCSE-Unsup生成各项权利要求的句向量,对其求和作为专利表示
TCS 以技术组合句的形式重组权利要求,通过SimCSE-Unsup生成句向量后求和作为专利表示
TCS-R 以技术组合句的形式重组权利要求,通过信息丰富度加权,再通过SimCSE-Unsup生成句向量后求和作为专利表示
TCS-C 以技术组合句的形式重组权利要求,通过信息核心度加权,再通过SimCSE-Unsup生成句向量后求和作为专利表示
PatentSBERTa 专利相似度计算效果超越其他同类方法,利用AugSBERT模型与引文树形式生成专利文本向量
Technological Signature 向量表示实现了语义与统计特征的结合,应用于专利相似度计算取得了较好的效果
SimCSE 在语义相似度任务上取得突破性进展,其核心是通过对比学习自监督地学习句向量
Doc2Vec 最通用的句子嵌入方法之一,通过文档内前后词的预测任务生成句向量
TFIDF-Mittens 表达领域特征的句向量表示方法,扩展GloVe模型生成面向领域的词向量再结合权重生成句向量
Mittens+WR 表达领域特征的加权句向量表示方法,在TFIDF-Mittens方法基础上通过随机游走模型加权合成句向量
VSM 可解释度最高的句子嵌入方法之一,定位高权词结合出现次数构成句向量
Table 1  对比基准方法
方法 相关性(%) p-value
Base 24.55 0.004 3
OS 22.43 0.003 2
TCS 26.37 0.003 3
TCS-R 27.67 0.003 2
TCS-C 26.41 0.003 1
TCS-RC 27.72 0.003 1
Table 2  技术组合句与特征加权方法评估结果
指标 TCS-RC TCS
平均值 0.540 9 0.579 7
标准差 0.915 2 1.159 8
z-value -10.626 7
p-value(单侧) 0.000 0
Table 3  显著性检验结果(Ⅰ)
指标 TCS OS
平均值 0.579 7 0.667 8
标准差 1.159 8 1.539 6
z-value -18.462 0
p-value(单侧) 0.000 0
Table 4  显著性检验结果(Ⅱ)
指标 TCS Base
平均值 0.579 7 0.592 9
标准差 1.159 8 1.383 4
z-value -2.939 5
p-value(单侧) 0.001 6
Table 5  显著性检验结果(Ⅲ)
方法 相关性(%) p-value
TCS-RC 27.72 0.003 1
PatentSBERTa 13.63 0.003 5
Technological Signature 17.90 0.002 1
SimCSE 24.55 0.004 3
Doc2Vec 21.72 0.003 6
TFIDF-Mittens 19.25 0.003 5
Mittens+WR 22.16 0.003 2
VSM 25.48 0.003 7
Table 6  专利相似度计算方法对比结果
指标 TCS-RC Mittens+WR
平均值 0.540 9 0.962 2
标准差 0.915 2 0.144 6
z-value -183.983 4
p-value(单侧) 0.000 0
Table 7  显著性检验结果(Ⅳ)
指标 TCS-RC VSM
平均值 0.540 9 0.573 1
标准差 0.915 2 0.673 8
z-value -11.425 9
p-value(单侧) 0.000 0
Table 8  显著性检验结果(Ⅴ)
数量 比例(%)
G-物理 1 141 53.97
H-电学 690 32.64
A-人类生活必需 120 5.68
B-作业、运输 61 2.89
C-化学、冶金 78 3.69
F-机械工程 15 0.71
E-固定建筑物 9 0.43
Table 9  量子计算领域专利分布
方法 损失 准确率(%) 精确率(%)
TCS-RC 0.489 74.65 66.67
PatentSBERTa 0.611 65.45 64.60
Technological Signature 0.605 69.18 62.50
SimCSE 0.659 64.32 53.33
Doc2Vec 0.598 69.44 53.80
TFIDF-Mittens 0.587 72.05 66.29
Mittens+WR 0.638 64.93 52.20
VSM 0.655 69.01 52.71
Table 10  模型分类效果对比结果
[1] Qiu Z P, Wang Z. Technology Forecasting Based on Semantic and Citation Analysis of Patents: A Case of Robotics Domain[J]. IEEE Transactions on Engineering Management, 2022, 69(4): 1216-1236.
doi: 10.1109/TEM.2020.2978849
[2] 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
[2] (Liu Xiaoling, Tan Zongying. Clustering Technology Topics Based on Patent Multi-Attribute Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 45-54.)
[3] Kim T S, Sohn S Y. Machine-Learning-Based Deep Semantic Analysis Approach for Forecasting New Technology Convergence[J]. Technological Forecasting and Social Change, 2020, 157: Article No.120095.
[4] 寇园园, 陈会英, 徐华杰, 等. 海外跨国公司在华人工智能专利布局及竞争态势研究[J]. 情报杂志, 2022, 41(9): 48-54.
[4] (Kou Yuanyuan, Chen Huiying, Xu Huajie, et al. Study on AI Patent Layout and Competitive Situation of Overseas Multinational Companies in China[J]. Journal of Intelligence, 2022, 41(9): 48-54.)
[5] 吕学强, 罗艺雄, 李家全, 等. 中文专利侵权检测研究综述[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[5] (Lv Xueqiang, Luo Yixiong, Li Jiaquan, et al. Review of Studies on Detecting Chinese Patent Infringements[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 60-68.)
[6] Bekamiri H, Hain D S, Jurowetzki R. PatentSBERTa: A Deep NLP Based Hybrid Model for Patent Distance and Classification Using Augmented SBERT[OL]. arXiv Preprint, arXiv: 2103.11933.
[7] 俞琰, 鞠鹏, 尚明杰. 基于信息增益与相似度的专利关键词抽取算法评价模型[J]. 图书情报工作, 2022, 66(6): 108-117.
doi: 10.13266/j.issn.0252-3116.2022.06.012
[7] (Yu Yan, Ju Peng, Shang Mingjie. Research on the Evaluation Method of Patent Keyword Extraction Algorithm Based on Information Gain and Similarity[J]. Library and Information Service, 2022, 66(6): 108-117.)
doi: 10.13266/j.issn.0252-3116.2022.06.012
[8] Chen L X. Do Patent Citations Indicate Knowledge Linkage? The Evidence from Text Similarities Between Patents and Their Citations[J]. Journal of Informetrics, 2017, 11(1): 63-79.
doi: 10.1016/j.joi.2016.04.018
[9] Zhou Y, Dong F, Liu Y F, et al. A Deep Learning Framework to Early Identify Emerging Technologies in Large-Scale Outlier Patents: An Empirical Study of CNC Machine Tool[J]. Scientometrics, 2021, 126(2): 969-994.
doi: 10.1007/s11192-020-03797-8
[10] Frerich K, Bukowski M, Geisler S, et al. On the Potential of Taxonomic Graphs to Improve Applicability and Performance for the Classification of Biomedical Patents[J]. Applied Sciences, 2021, 11(2): Article No.690.
[11] Lee M, Lee S. Identifying New Business Opportunities from Competitor Intelligence: An Integrated Use of Patent and Trademark Databases[J]. Technological Forecasting and Social Change, 2017, 119: 170-183.
doi: 10.1016/j.techfore.2017.03.026
[12] 高楠, 彭鼎原, 傅俊英, 等. 基于专利IPC分类与文本信息的前沿技术演进分析——以人工智能领域为例[J]. 情报理论与实践, 2020, 43(4): 123-129.
[12] (Gao Nan, Peng Dingyuan, Fu Junying, et al. Research on Technology Fronts Prediction Based on Patent IPC Classification and Text Information: Taking the Field of Artificial Intelligence as an Example[J]. Information Studies: Theory & Application, 2020, 43(4): 123-129.)
[13] 高道斌, 吴红, 张彪, 等. 基于改进技术相似度计算的竞争对手辨别研究[J]. 情报杂志, 2022, 41(8): 53-61.
[13] (Gao Daobin, Wu Hong, Zhang Biao, et al. Research on Competitor Identification Based on Improved Technology Similarity Calculation[J]. Journal of Intelligence, 2022, 41(8): 53-61.)
[14] 向姝璇, 李睿. 基于专利文献整体相似度计算的竞争对手发现——以5G领域为例[J]. 情报理论与实践, 2021, 44(5): 100-105.
[14] (Xiang Shuxuan, Li Rui. Competitor Discovery Based on Overall Similarity Calculation of Patent Documents: A Case Study of 5G Domain[J]. Information Studies: Theory & Application, 2021, 44(5): 100-105.)
[15] Yun S, Cho W, Kim C, et al. Technological Trend Mining: Identifying New Technology Opportunities Using Patent Semantic Analysis[J]. Information Processing & Management, 2022, 59(4): Article No.102993.
[16] 俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[16] (Yu Yan, Chen Lei, Jiang Jinde, et al. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 53-59.)
[17] Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61: Article No.101965.
[18] Lei L, Qi J J, Zheng K. Patent Analytics Based on Feature Vector Space Model: A Case of IoT[J]. IEEE Access, 2019, 7: 45705-45715.
doi: 10.1109/ACCESS.2019.2909123
[19] Hain D S, Jurowetzki R, Buchmann T, et al. A Text-Embedding-Based Approach to Measuring Patent-to-Patent Technological Similarity[J]. Technological Forecasting and Social Change, 2022, 177: Article No.121559.
[20] Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2): 721-744.
doi: 10.1007/s11192-018-2905-5
[21] Qi J J, Lei L, Zheng K, et al. Patent Analytic Citation-Based VSM: Challenges and Applications[J]. IEEE Access, 2020, 8: 17464-17476.
doi: 10.1109/Access.6287639
[22] 张杰, 魏鹏涛, 翟东升. 基于权利要求分解和相似度排序的专利无效检索研究[J]. 情报理论与实践, 2019, 42(12): 108-114.
doi: 10.16353/j.cnki.1000-7490.2019.12.017
[22] (Zhang Jie, Wei Pengtao, Zhai Dongsheng. Research on Patent Invalidity Search Based on Claim Decomposition and Similarity Ranking[J]. Information Studies: Theory & Application, 2019, 42(12): 108-114.)
doi: 10.16353/j.cnki.1000-7490.2019.12.017
[23] 国家知识产权局. 中华人民共和国专利法实施细则[EB/OL]. [2015-09-07]. https://www.cnipa.gov.cn/art/2015/9/7/art_98_28200.html.
[23] (China National Intellectual Property Administration. Rules for Implementation of the Patent Law of the People's Republic of China[EB/OL]. [2015-09-07]. https://www.cnipa.gov.cn/art/2015/9/7/art_98_28200.html.)
[24] 康旭东, 邓乐乐, 王宇开, 等. 基于全代引证的专利累积影响力评价——一个诺奖得主专利的案例研究[J]. 情报学报, 2021, 40(3): 267-277.
[24] (Kang Xudong, Deng Lele, Wang Yukai, et al. Evaluation of Patents. Cumulative Impact Based on all Generations of Citations: A Case Study of a Nobel Prize Winner's Patents[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(3): 267-277.)
[25] Mirisaee H, Gaussier E, Lagnier C, et al. Terminology-Based Text Embedding for Computing Document Similarities on Technical Content[OL]. arXiv Preprint, arXiv: 1906.01874.
[26] Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings[OL]. arXiv Preprint, arXiv: 2104.08821.
[27] Costa Y M G, Bertolini D, Britto A S, et al. The Dissimilarity Approach: A Review[J]. Artificial Intelligence Review, 2020, 53(4): 2783-2808.
doi: 10.1007/s10462-019-09746-z
[28] Riesen K, Bunke H. Graph Classification Based on Vector Space Embedding[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2009, 23(6): 1053-1081.
doi: 10.1142/S021800140900748X
[29] Paclik P, Duin R P W. A Generalized Kernel Approach to Dissimilarity-Based Classification[J]. Journal of Machine Learning Research, 2002, 2(2): 175-211.
[30] Bille P. A Survey on Tree Edit Distance and Related Problems[J]. Theoretical Computer Science, 2005, 337(1-3): 217-239.
doi: 10.1016/j.tcs.2004.12.030
[31] Reimers N, Beyer P, Gurevych I. Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity[C]// Proceedings of the 26th International Conference on Computational Linguistics:Technical Papers. 2016: 87-96.
[32] 李睿, 王堂蓉, 龙瑞. 专利引证与专利维持时间的相关性实证[J]. 情报杂志, 2022, 41(7): 71-76.
[32] (Li Rui, Wang Tangrong, Long Rui. Empirical Research on the Correlation Between Patent Citations and Patent Maintenance Time[J]. Journal of Intelligence, 2022, 41(7): 71-76.)
[33] Du L, Liu W D, Xiao K Y, et al. Technical Function-Effect Based Patent Multi-to-One Negation Game Model[C]// Proceedings of the 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2022: 1443-1448.
[34] Lau J H, Baldwin T. An Empirical Evaluation of Doc2Vec with Practical Insights into Document Embedding Generation[OL]. arXiv Preprint, arXiv: 1607.05368.
[35] Dingwall N, Potts C. Mittens: An Extension of GloVe for Learning Domain-Specialized Representations[OL]. arXiv Preprint, arXiv: 1803.09901.
[36] Ethayarajh K. Unsupervised Random Walk Sentence Embeddings: A Strong But Simple Baseline[C]// Proceedings of the 3rd Workshop on Representation Learning for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 91-100.
[37] Kim G J, Park S S, Jang D S. Technology Forecasting Using Topic-Based Patent Analysis[J]. Journal of Scientific and Industrial Research, 2015, 74(5): 265-270.
[1] 邓娜, 何昕洋, 陈伟杰, 陈旭. MPMFC:一种融合网络邻里结构特征和专利语义特征的中药专利分类模型*[J]. 数据分析与知识发现, 2023, 7(4): 145-158.
[2] 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
[3] 俞琰, 朱晟忱. 融入限定关系的专利关键词抽取方法*[J]. 数据分析与知识发现, 2022, 6(10): 57-67.
[4] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[5] 俞琰,陈磊,姜金德,赵乃瑄. 结合词向量和统计特征的专利相似度测量方法 *[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[6] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn