Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 53-59    DOI: 10.11925/infotech.2096-3467.2018.1317
Measuring Patent Similarity with Word Embedding and Statistical Features
Yan Yu1,2(),Lei Chen1,Jinde Jiang3,Naixuan Zhao1
1 Information Service Department, Nanjing Tech University, Nanjing 210009, China
2 Department of Computer Engineering, Southeast University Chengxian College, Nanjing 211816, China
3 School of Economics and Management, Nanjing Xiaozhuang University, Nanjing 210028, China
[Objective] This paper proposes a new method measuring patent similarities, which explores the semantic relationship between words and improves the performance of these tasks. [Methods] First, we introduced a neural network-based word vector model to obtain semantic information from patent words. Then, we computed the word statistical features to gauge their significance. Finally, we combined the word embedding and statistical features to represent patent texts and measure their similarity. [Results] The accuracy of the proposed method was 13.92% higher than those of the traditional methods. [Limitations] More research is needed to study the selection strategy of auxiliary patent texts. [Conclusions] Combining word embedding and statistical features can effectively improve the patent similarity measurement.

Key wordsPatent Similarity      Word Embedding      Statistical Feature     
Received: 25 November 2018      Published: 23 October 2019
ZTFLH:  G202 G35  

Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features. Data Analysis and Knowledge Discovery, 2019, 3(9): 53-59.

IPC小类含义 待分析专利文本数量 辅助专利文本数量
G06F 电数字数据处理 500 10 000
G06K 数据识别; 数据表示; 记录载体; 记录载体的处理 500 10 000
G06M 计数机构; 其对象未列入其他类目内的计数 500 10 000
G06Q 专门适用于行政、商业、金融、管理、监督或预测目的的数据处理系统或方法 500 10 000
G06T 一般的图像数据处理或产生 500 10 000
