Measuring Patent Similarity with Word Embedding and Statistical Features
Yan Yu1,2(),Lei Chen1,Jinde Jiang3,Naixuan Zhao1
1 Information Service Department, Nanjing Tech University, Nanjing 210009, China 2 Department of Computer Engineering, Southeast University Chengxian College, Nanjing 211816, China 3 School of Economics and Management, Nanjing Xiaozhuang University, Nanjing 210028, China
[Objective] This paper proposes a new method measuring patent similarities, which explores the semantic relationship between words and improves the performance of these tasks. [Methods] First, we introduced a neural network-based word vector model to obtain semantic information from patent words. Then, we computed the word statistical features to gauge their significance. Finally, we combined the word embedding and statistical features to represent patent texts and measure their similarity. [Results] The accuracy of the proposed method was 13.92% higher than those of the traditional methods. [Limitations] More research is needed to study the selection strategy of auxiliary patent texts. [Conclusions] Combining word embedding and statistical features can effectively improve the patent similarity measurement.
俞琰,陈磊,姜金德,赵乃瑄. 结合词向量和统计特征的专利相似度测量方法 *[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features. Data Analysis and Knowledge Discovery, 2019, 3(9): 53-59.
Engelsman E C, Van Raan A F J . A Patent-based Cartography of Technology[J]. Research Policy, 1994,23(1):1-26.
[2]
Breschi S, Lissoni F, Malerba F . Knowledge-relatedness in Firm Technological Diversification[J]. Research Policy, 2003,32(1):69-87.
[3]
Leydesdorff L . Patent Classifications as Indicators of Intellectual Organization[J]. Journal of the American Society for Information Science & Technology, 2010,59(10):1582-1597.
[4]
Joo S H, Kim Y . Measuring Relatedness Between Technological Fields[J]. Scientometrics, 2010,83(2):435-454.
doi: 10.1007/s11192-009-0108-9
[5]
Chang S B . Using Patent Analysis to Establish Technological Position: Two Different Strategic Approaches[J]. Technological Forecasting and Social Change, 2012,79(1):3-15.
doi: 10.1016/j.techfore.2011.07.002
[6]
McGill J P . Technological Knowledge and Governance in Alliances Among Competitors[J]. International Journal of Technology Management, 2007,38(1-2):69-89.
( Li Rui, Zhang Lingling, Guo Shiyue . To Compare Two Methods for Patents Clustering: Co-citation Clustering and Citing Coupling Clustering[J]. Library and Information Service, 2012,56(8):91-95.)
[8]
Yoon B, Park Y . A Text-mining-based Patent Network: Analytical Tool for High-technology Trend[J]. Journal of High Technology Management Research, 2004,15(1):37-50.
[9]
Lee S, Lee S, Seol H , et al. Using Patent Information for Designing New Product and Technology: Keyword Based Technology Roadmapping[J]. R&D Management, 2008,38(2):169-188.
[10]
Kim Y G, Suh J H, Park S C . Visualization of Patent Analysis for Emerging Technology[J]. Expert Systems with Applications, 2008,34(3):1804-1812.
doi: 10.1016/j.eswa.2007.01.033
[11]
Taghaboni-Dutta F, Trappey A J C, Trappey C V , et al. An Exploratory RFID Patent Analysis[J]. Management Research News, 2009,32(12):1163-1176.
[12]
Lim S S, Jung S W, Kwon H C. Improving Patent Retrieval System Using Ontology [C]// Proceedings of the 30th Annual Conference of IEEE Industrial Electronics Society. 2004: 2646-2649.
[13]
周群芳 . 相似专利检测研究[J]. 现代图书情报技术, 2012(11):60-64.
[13]
( Zhou Qunfang . Study on Detection Method of Similarity Patents[J]. New Technology of Library and Information Service, 2012(11):60-64.)
[14]
Park H, Yoon J, Kim K . Identifying Patent Infringement Using SAO Based Semantic Technological Similarities[J]. Scientometrics, 2011,90(2):515-529.
[15]
Choi S, Yoon J, Kim K , et al. SAO Network Analysis of Patents for Technology Trends Identification: A Case Study of Polymer Electrolyte Membrane Technology in Proton Exchange Membrane Fuel Cells[J]. Scientometrics, 2011,88(3):863-883.
doi: 10.1007/s11192-011-0420-z
[16]
Yoon J, Kim K . Identifying Rapidly Evolving Technological Trends for R&D Planning Using SAO-based Semantic Patent Networks[J]. Scientometrics, 2011,88(1):213-228.
doi: 10.1007/s11192-011-0383-0
[17]
Park H, Yoon J, Kim K . Identification and Evaluation of Corporations for Merger and Acquisition Strategies Using Patent Information and Text Mining[J]. Scientometrics, 2013,97(3):883-909.
doi: 10.1007/s11192-013-1010-z
[18]
Magerman T, Van Looy B, Song X . Exploring the Feasibility and Accuracy of Latent Semantic Analysis Based Text Mining Techniques to Detect Similarity Between Patent Documents and Scientific Publications[J]. Scientometrics, 2009,82(2):289-306.
( Chen Liang, Yang Guancan, Zhang Jing , et al. Research on Multiple Main Paths Method Oriented to Analysis of Technological Evolution[J]. Library and Information Service, 2015,59(10):124-130.)
( Liao Liefa, Le Fugang, Zhu Yalan . The Application of LDA Model in Patent Text Classification[J]. Modern Information, 2017,37(3):35-39.)
[21]
Kalchbrenner N, Grefenstette E, Blunsom P. A Convolutional Neural Network for Modelling Sentences [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 655-665.
[22]
Liu P, Qiu X, Huang X. Recurrent Neural Network for Text Classification with Multi-Task Learning [C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016: 2873-2879.
[23]
Le Q, Mikolov T . Distributed Representations of Sentences and Documents[OL]. arXiv Preprint, arXiv: 1405. 4053.
( Cao Qi, Zhao Wei, Zhang Yingjie , et al. Comparative Study of Patent Documents Similarity Detection on Deep Learning of Doc2Vec Based Methods[J]. Library and Information Service, 2018,62(13):74-81.)
[26]
Xing C, Wang D, Zhang X, et al. Document Classification with Distribution of Word Vectors [C]// Proceedings of the Signal and Information Processing Association Annual Summit and Conference. 2014: 1-5.
( Gao Mingxia, Li Jingwei . Chinese Short Text Classification Method Based on Word2Vec Embedding[J]. Journal of Shandong University: Engineering Science, 2019,49(2):34-41.)
[28]
Kim H K, Kim H, Chao S . Bag-of-Concepts: Comprehending Document Representation Through Clustering Words in Distributed Representation[J]. Neurocomputing, 2017,266:336-352.
( Zhou Shunxian, Jiang Li, Lin Shuangqiao , et al. Characteristic Representation Method of Document Based on Word2Vector[J]. Journal of Chongqing University of Posts and Telecommunications: Natural Science Edition, 2018,30(2):272-279.)
[30]
Kusner M J, Sun Y, Kolkin N I, et al. From Word Embeddings to Document Distances [C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
( Gu Chongyang, Xu Haoyu, Zhou Han , et al. Text Similarity Computing Based on Lexical Semantic Information[J]. Application Research of Computers, 2018,35(2):391-395.)
[33]
Mikolov T, Sutskever I, Chen K , et al.Distributed Representations of Words and Phrases and Their Compositionality[OL]. arXiv Preprint, arXiv: 1310. 4546.
[34]
Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
( Yu Yan, Zhao Naixuan . Patent Term Extraction Based on Generic Words and Term Components[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(7):742-752.)