Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 53-59    DOI: 10.11925/infotech.2096-3467.2018.1317
Current Issue | Archive | Adv Search |
Measuring Patent Similarity with Word Embedding and Statistical Features
Yan Yu1,2(),Lei Chen1,Jinde Jiang3,Naixuan Zhao1
1 Information Service Department, Nanjing Tech University, Nanjing 210009, China
2 Department of Computer Engineering, Southeast University Chengxian College, Nanjing 211816, China
3 School of Economics and Management, Nanjing Xiaozhuang University, Nanjing 210028, China
Download: PDF (504 KB)   HTML ( 12
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method measuring patent similarities, which explores the semantic relationship between words and improves the performance of these tasks. [Methods] First, we introduced a neural network-based word vector model to obtain semantic information from patent words. Then, we computed the word statistical features to gauge their significance. Finally, we combined the word embedding and statistical features to represent patent texts and measure their similarity. [Results] The accuracy of the proposed method was 13.92% higher than those of the traditional methods. [Limitations] More research is needed to study the selection strategy of auxiliary patent texts. [Conclusions] Combining word embedding and statistical features can effectively improve the patent similarity measurement.

Key wordsPatent Similarity      Word Embedding      Statistical Feature     
Received: 25 November 2018      Published: 23 October 2019
ZTFLH:  G202 G35  

Cite this article:

Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features. Data Analysis and Knowledge Discovery, 2019, 3(9): 53-59.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1317     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I9/53

IPC
小类
IPC小类含义 待分析专利文本数量 辅助专利文本数量
G06F 电数字数据处理 500 10 000
G06K 数据识别; 数据表示; 记录载体; 记录载体的处理 500 10 000
G06M 计数机构; 其对象未列入其他类目内的计数 500 10 000
G06Q 专门适用于行政、商业、金融、管理、监督或预测目的的数据处理系统或方法 500 10 000
G06T 一般的图像数据处理或产生 500 10 000
[1] Engelsman E C, Van Raan A F J . A Patent-based Cartography of Technology[J]. Research Policy, 1994,23(1):1-26.
[2] Breschi S, Lissoni F, Malerba F . Knowledge-relatedness in Firm Technological Diversification[J]. Research Policy, 2003,32(1):69-87.
[3] Leydesdorff L . Patent Classifications as Indicators of Intellectual Organization[J]. Journal of the American Society for Information Science & Technology, 2010,59(10):1582-1597.
[4] Joo S H, Kim Y . Measuring Relatedness Between Technological Fields[J]. Scientometrics, 2010,83(2):435-454.
doi: 10.1007/s11192-009-0108-9
[5] Chang S B . Using Patent Analysis to Establish Technological Position: Two Different Strategic Approaches[J]. Technological Forecasting and Social Change, 2012,79(1):3-15.
doi: 10.1016/j.techfore.2011.07.002
[6] McGill J P . Technological Knowledge and Governance in Alliances Among Competitors[J]. International Journal of Technology Management, 2007,38(1-2):69-89.
[7] 李睿, 张玲玲, 郭世月 . 专利同被引聚类与专利引用耦合聚类的对比分析[J]. 图书情报工作, 2012,56(8):91-95.
[7] ( Li Rui, Zhang Lingling, Guo Shiyue . To Compare Two Methods for Patents Clustering: Co-citation Clustering and Citing Coupling Clustering[J]. Library and Information Service, 2012,56(8):91-95.)
[8] Yoon B, Park Y . A Text-mining-based Patent Network: Analytical Tool for High-technology Trend[J]. Journal of High Technology Management Research, 2004,15(1):37-50.
[9] Lee S, Lee S, Seol H , et al. Using Patent Information for Designing New Product and Technology: Keyword Based Technology Roadmapping[J]. R&D Management, 2008,38(2):169-188.
[10] Kim Y G, Suh J H, Park S C . Visualization of Patent Analysis for Emerging Technology[J]. Expert Systems with Applications, 2008,34(3):1804-1812.
doi: 10.1016/j.eswa.2007.01.033
[11] Taghaboni-Dutta F, Trappey A J C, Trappey C V , et al. An Exploratory RFID Patent Analysis[J]. Management Research News, 2009,32(12):1163-1176.
[12] Lim S S, Jung S W, Kwon H C. Improving Patent Retrieval System Using Ontology [C]// Proceedings of the 30th Annual Conference of IEEE Industrial Electronics Society. 2004: 2646-2649.
[13] 周群芳 . 相似专利检测研究[J]. 现代图书情报技术, 2012(11):60-64.
[13] ( Zhou Qunfang . Study on Detection Method of Similarity Patents[J]. New Technology of Library and Information Service, 2012(11):60-64.)
[14] Park H, Yoon J, Kim K . Identifying Patent Infringement Using SAO Based Semantic Technological Similarities[J]. Scientometrics, 2011,90(2):515-529.
[15] Choi S, Yoon J, Kim K , et al. SAO Network Analysis of Patents for Technology Trends Identification: A Case Study of Polymer Electrolyte Membrane Technology in Proton Exchange Membrane Fuel Cells[J]. Scientometrics, 2011,88(3):863-883.
doi: 10.1007/s11192-011-0420-z
[16] Yoon J, Kim K . Identifying Rapidly Evolving Technological Trends for R&D Planning Using SAO-based Semantic Patent Networks[J]. Scientometrics, 2011,88(1):213-228.
doi: 10.1007/s11192-011-0383-0
[17] Park H, Yoon J, Kim K . Identification and Evaluation of Corporations for Merger and Acquisition Strategies Using Patent Information and Text Mining[J]. Scientometrics, 2013,97(3):883-909.
doi: 10.1007/s11192-013-1010-z
[18] Magerman T, Van Looy B, Song X . Exploring the Feasibility and Accuracy of Latent Semantic Analysis Based Text Mining Techniques to Detect Similarity Between Patent Documents and Scientific Publications[J]. Scientometrics, 2009,82(2):289-306.
[19] 陈亮, 杨冠灿, 张静 , 等. 面向技术演化分析的多主路径方法研究[J]. 图书情报工作, 2015,59(10):124-130.
[19] ( Chen Liang, Yang Guancan, Zhang Jing , et al. Research on Multiple Main Paths Method Oriented to Analysis of Technological Evolution[J]. Library and Information Service, 2015,59(10):124-130.)
[20] 廖列法, 勒孚刚, 朱亚兰 . LDA模型在专利文本分类中的应用[J]. 现代情报, 2017,37(3):35-39.
[20] ( Liao Liefa, Le Fugang, Zhu Yalan . The Application of LDA Model in Patent Text Classification[J]. Modern Information, 2017,37(3):35-39.)
[21] Kalchbrenner N, Grefenstette E, Blunsom P. A Convolutional Neural Network for Modelling Sentences [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 655-665.
[22] Liu P, Qiu X, Huang X. Recurrent Neural Network for Text Classification with Multi-Task Learning [C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016: 2873-2879.
[23] Le Q, Mikolov T . Distributed Representations of Sentences and Documents[OL]. arXiv Preprint, arXiv: 1405. 4053.
[24] 张海超, 赵良伟 . 利用Doc2Vec判断中文专利相似性[J]. 情报工程, 2018,4(2):64-72.
[24] ( Zhang Haichao, Zhao Liangwei . Judge Chinese Patents Similarity Based on Doc2Vec[J]. Technology Intelligence Engineering, 2018,4(2):64-72.)
[25] 曹祺, 赵伟, 张英杰 , 等. 基于Doc2Vec的专利文件相似度检测方法的对比研究[J]. 图书情报工作, 2018,62(13):74-81.
[25] ( Cao Qi, Zhao Wei, Zhang Yingjie , et al. Comparative Study of Patent Documents Similarity Detection on Deep Learning of Doc2Vec Based Methods[J]. Library and Information Service, 2018,62(13):74-81.)
[26] Xing C, Wang D, Zhang X, et al. Document Classification with Distribution of Word Vectors [C]// Proceedings of the Signal and Information Processing Association Annual Summit and Conference. 2014: 1-5.
[27] 高明霞, 李经纬 . 基于Word2Vec词模型的中文短文本分类方法[J]. 山东大学学报: 工学版, 2019,49(2):34-41.
[27] ( Gao Mingxia, Li Jingwei . Chinese Short Text Classification Method Based on Word2Vec Embedding[J]. Journal of Shandong University: Engineering Science, 2019,49(2):34-41.)
[28] Kim H K, Kim H, Chao S . Bag-of-Concepts: Comprehending Document Representation Through Clustering Words in Distributed Representation[J]. Neurocomputing, 2017,266:336-352.
[29] 周顺先, 蒋励, 林霜巧 , 等. 基于Word2Vector的文本特征化表示方法[J]. 重庆邮电大学学报: 自然科学版, 2018,30(2):272-279.
[29] ( Zhou Shunxian, Jiang Li, Lin Shuangqiao , et al. Characteristic Representation Method of Document Based on Word2Vector[J]. Journal of Chongqing University of Posts and Telecommunications: Natural Science Edition, 2018,30(2):272-279.)
[30] Kusner M J, Sun Y, Kolkin N I, et al. From Word Embeddings to Document Distances [C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
[31] 李琳, 李辉 . 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018,2(5):48-58.
[31] ( Li Lin, Li Hui . Computing Text Similarity Based on Concept Vector Space[J]. Data Analysis and Knowledge Discovery, 2018,2(5):48-58.)
[32] 谷重阳, 徐浩煜, 周晗 , 等. 基于词汇语义信息的文本相似度计算[J]. 计算机应用研究, 2018,35(2):391-395.
[32] ( Gu Chongyang, Xu Haoyu, Zhou Han , et al. Text Similarity Computing Based on Lexical Semantic Information[J]. Application Research of Computers, 2018,35(2):391-395.)
[33] Mikolov T, Sutskever I, Chen K , et al.Distributed Representations of Words and Phrases and Their Compositionality[OL]. arXiv Preprint, arXiv: 1310. 4546.
[34] Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[35] 俞琰, 赵乃瑄 . 基于通用词与术语部件的专利术语抽取[J]. 情报学报, 2018,37(7):742-752.
[35] ( Yu Yan, Zhao Naixuan . Patent Term Extraction Based on Generic Words and Term Components[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(7):742-752.)
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[3] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[4] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[5] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[6] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[7] Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[8] Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[9] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[10] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[11] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[12] Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[13] Wang Tingting,Han Man,Wang Yu. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[14] Zhang Qin,Guo Hongmei,Zhang Zhixiong. Extracting Entity Relationship with Word Embedding Representation Features[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[15] Xia Tian. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn