Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (9): 53-59    DOI: 10.11925/infotech.2096-3467.2018.1317
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
结合词向量和统计特征的专利相似度测量方法 *
俞琰1,2(),陈磊1,姜金德3,赵乃瑄1
1 南京工业大学信息服务部 南京 210009
2 东南大学成贤学院计算机工程系 南京 211816
3 南京晓庄学院经济与管理学院 南京 210028
Measuring Patent Similarity with Word Embedding and Statistical Features
Yan Yu1,2(),Lei Chen1,Jinde Jiang3,Naixuan Zhao1
1 Information Service Department, Nanjing Tech University, Nanjing 210009, China
2 Department of Computer Engineering, Southeast University Chengxian College, Nanjing 211816, China
3 School of Economics and Management, Nanjing Xiaozhuang University, Nanjing 210028, China
全文: PDF(504 KB)   HTML ( 9
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】针对传统专利相似度测量忽略词语语义关系的问题, 提出一种新的专利相似度测量方法, 以提高专利相似度测量的准确度。【方法】引入基于神经网络的词向量模型, 获得专利文本中词的语义信息; 计算词统计特征信息, 度量不同词在专利文本中的重要程度; 最后结合词向量和统计特征, 形成专利文本表示, 测量专利相似度。【结果】本文所提结合词向量和统计特征的专利相似度测量方法比传统的空间向量方法表示专利文本相似度方法准确率提高了13.92%。【局限】辅助专利文本集的选取策略有待进一步研究。【结论】使用空间向量方法表示专利文本结合词向量和统计特征能够显著提高专利相似度测量的准确度。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
俞琰
陈磊
姜金德
赵乃瑄
关键词 专利相似度词向量统计特征    
Abstract

[Objective] This paper proposes a new method measuring patent similarities, which explores the semantic relationship between words and improves the performance of these tasks. [Methods] First, we introduced a neural network-based word vector model to obtain semantic information from patent words. Then, we computed the word statistical features to gauge their significance. Finally, we combined the word embedding and statistical features to represent patent texts and measure their similarity. [Results] The accuracy of the proposed method was 13.92% higher than those of the traditional methods. [Limitations] More research is needed to study the selection strategy of auxiliary patent texts. [Conclusions] Combining word embedding and statistical features can effectively improve the patent similarity measurement.

Key wordsPatent Similarity    Word Embedding    Statistical Feature
收稿日期: 2018-11-25     
中图分类号:  G202 G35  
基金资助:*本文系国家社会科学基金一般规划项目“大数据时代支持创新设计的多维度多层次专利文本挖掘研究”(项目编号: 17BTQ059);教育部人文社会科学规划项目“大数据时代技能知识图谱构建研究”(项目编号: 16YJAZH073)
引用本文:   
俞琰,陈磊,姜金德,赵乃瑄. 结合词向量和统计特征的专利相似度测量方法 *[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1317.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1317
IPC
小类
IPC小类含义 待分析专利文本数量 辅助专利文本数量
G06F 电数字数据处理 500 10 000
G06K 数据识别; 数据表示; 记录载体; 记录载体的处理 500 10 000
G06M 计数机构; 其对象未列入其他类目内的计数 500 10 000
G06Q 专门适用于行政、商业、金融、管理、监督或预测目的的数据处理系统或方法 500 10 000
G06T 一般的图像数据处理或产生 500 10 000
表1  数据集基本信息
图1  词相关加权表示法有效性
图2  语义VSM表示法的有效性
图3  词向量加权表示法与语义VSM表示法的 有效性比较
图4  方法比较结果
[1] Engelsman E C, Van Raan A F J . A Patent-based Cartography of Technology[J]. Research Policy, 1994,23(1):1-26.
[2] Breschi S, Lissoni F, Malerba F . Knowledge-relatedness in Firm Technological Diversification[J]. Research Policy, 2003,32(1):69-87.
[3] Leydesdorff L . Patent Classifications as Indicators of Intellectual Organization[J]. Journal of the American Society for Information Science & Technology, 2010,59(10):1582-1597.
[4] Joo S H, Kim Y . Measuring Relatedness Between Technological Fields[J]. Scientometrics, 2010,83(2):435-454.
doi: 10.1007/s11192-009-0108-9
[5] Chang S B . Using Patent Analysis to Establish Technological Position: Two Different Strategic Approaches[J]. Technological Forecasting and Social Change, 2012,79(1):3-15.
doi: 10.1016/j.techfore.2011.07.002
[6] McGill J P . Technological Knowledge and Governance in Alliances Among Competitors[J]. International Journal of Technology Management, 2007,38(1-2):69-89.
[7] 李睿, 张玲玲, 郭世月 . 专利同被引聚类与专利引用耦合聚类的对比分析[J]. 图书情报工作, 2012,56(8):91-95.
( Li Rui, Zhang Lingling, Guo Shiyue . To Compare Two Methods for Patents Clustering: Co-citation Clustering and Citing Coupling Clustering[J]. Library and Information Service, 2012,56(8):91-95.)
[8] Yoon B, Park Y . A Text-mining-based Patent Network: Analytical Tool for High-technology Trend[J]. Journal of High Technology Management Research, 2004,15(1):37-50.
[9] Lee S, Lee S, Seol H , et al. Using Patent Information for Designing New Product and Technology: Keyword Based Technology Roadmapping[J]. R&D Management, 2008,38(2):169-188.
[10] Kim Y G, Suh J H, Park S C . Visualization of Patent Analysis for Emerging Technology[J]. Expert Systems with Applications, 2008,34(3):1804-1812.
doi: 10.1016/j.eswa.2007.01.033
[11] Taghaboni-Dutta F, Trappey A J C, Trappey C V , et al. An Exploratory RFID Patent Analysis[J]. Management Research News, 2009,32(12):1163-1176.
[12] Lim S S, Jung S W, Kwon H C. Improving Patent Retrieval System Using Ontology [C]// Proceedings of the 30th Annual Conference of IEEE Industrial Electronics Society. 2004: 2646-2649.
[13] 周群芳 . 相似专利检测研究[J]. 现代图书情报技术, 2012(11):60-64.
( Zhou Qunfang . Study on Detection Method of Similarity Patents[J]. New Technology of Library and Information Service, 2012(11):60-64.)
[14] Park H, Yoon J, Kim K . Identifying Patent Infringement Using SAO Based Semantic Technological Similarities[J]. Scientometrics, 2011,90(2):515-529.
[15] Choi S, Yoon J, Kim K , et al. SAO Network Analysis of Patents for Technology Trends Identification: A Case Study of Polymer Electrolyte Membrane Technology in Proton Exchange Membrane Fuel Cells[J]. Scientometrics, 2011,88(3):863-883.
doi: 10.1007/s11192-011-0420-z
[16] Yoon J, Kim K . Identifying Rapidly Evolving Technological Trends for R&D Planning Using SAO-based Semantic Patent Networks[J]. Scientometrics, 2011,88(1):213-228.
doi: 10.1007/s11192-011-0383-0
[17] Park H, Yoon J, Kim K . Identification and Evaluation of Corporations for Merger and Acquisition Strategies Using Patent Information and Text Mining[J]. Scientometrics, 2013,97(3):883-909.
doi: 10.1007/s11192-013-1010-z
[18] Magerman T, Van Looy B, Song X . Exploring the Feasibility and Accuracy of Latent Semantic Analysis Based Text Mining Techniques to Detect Similarity Between Patent Documents and Scientific Publications[J]. Scientometrics, 2009,82(2):289-306.
[19] 陈亮, 杨冠灿, 张静 , 等. 面向技术演化分析的多主路径方法研究[J]. 图书情报工作, 2015,59(10):124-130.
( Chen Liang, Yang Guancan, Zhang Jing , et al. Research on Multiple Main Paths Method Oriented to Analysis of Technological Evolution[J]. Library and Information Service, 2015,59(10):124-130.)
[20] 廖列法, 勒孚刚, 朱亚兰 . LDA模型在专利文本分类中的应用[J]. 现代情报, 2017,37(3):35-39.
( Liao Liefa, Le Fugang, Zhu Yalan . The Application of LDA Model in Patent Text Classification[J]. Modern Information, 2017,37(3):35-39.)
[21] Kalchbrenner N, Grefenstette E, Blunsom P. A Convolutional Neural Network for Modelling Sentences [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 655-665.
[22] Liu P, Qiu X, Huang X. Recurrent Neural Network for Text Classification with Multi-Task Learning [C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016: 2873-2879.
[23] Le Q, Mikolov T . Distributed Representations of Sentences and Documents[OL]. arXiv Preprint, arXiv: 1405. 4053.
[24] 张海超, 赵良伟 . 利用Doc2Vec判断中文专利相似性[J]. 情报工程, 2018,4(2):64-72.
( Zhang Haichao, Zhao Liangwei . Judge Chinese Patents Similarity Based on Doc2Vec[J]. Technology Intelligence Engineering, 2018,4(2):64-72.)
[25] 曹祺, 赵伟, 张英杰 , 等. 基于Doc2Vec的专利文件相似度检测方法的对比研究[J]. 图书情报工作, 2018,62(13):74-81.
( Cao Qi, Zhao Wei, Zhang Yingjie , et al. Comparative Study of Patent Documents Similarity Detection on Deep Learning of Doc2Vec Based Methods[J]. Library and Information Service, 2018,62(13):74-81.)
[26] Xing C, Wang D, Zhang X, et al. Document Classification with Distribution of Word Vectors [C]// Proceedings of the Signal and Information Processing Association Annual Summit and Conference. 2014: 1-5.
[27] 高明霞, 李经纬 . 基于Word2Vec词模型的中文短文本分类方法[J]. 山东大学学报: 工学版, 2019,49(2):34-41.
( Gao Mingxia, Li Jingwei . Chinese Short Text Classification Method Based on Word2Vec Embedding[J]. Journal of Shandong University: Engineering Science, 2019,49(2):34-41.)
[28] Kim H K, Kim H, Chao S . Bag-of-Concepts: Comprehending Document Representation Through Clustering Words in Distributed Representation[J]. Neurocomputing, 2017,266:336-352.
[29] 周顺先, 蒋励, 林霜巧 , 等. 基于Word2Vector的文本特征化表示方法[J]. 重庆邮电大学学报: 自然科学版, 2018,30(2):272-279.
( Zhou Shunxian, Jiang Li, Lin Shuangqiao , et al. Characteristic Representation Method of Document Based on Word2Vector[J]. Journal of Chongqing University of Posts and Telecommunications: Natural Science Edition, 2018,30(2):272-279.)
[30] Kusner M J, Sun Y, Kolkin N I, et al. From Word Embeddings to Document Distances [C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
[31] 李琳, 李辉 . 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018,2(5):48-58.
( Li Lin, Li Hui . Computing Text Similarity Based on Concept Vector Space[J]. Data Analysis and Knowledge Discovery, 2018,2(5):48-58.)
[32] 谷重阳, 徐浩煜, 周晗 , 等. 基于词汇语义信息的文本相似度计算[J]. 计算机应用研究, 2018,35(2):391-395.
( Gu Chongyang, Xu Haoyu, Zhou Han , et al. Text Similarity Computing Based on Lexical Semantic Information[J]. Application Research of Computers, 2018,35(2):391-395.)
[33] Mikolov T, Sutskever I, Chen K , et al.Distributed Representations of Words and Phrases and Their Compositionality[OL]. arXiv Preprint, arXiv: 1310. 4546.
[34] Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[35] 俞琰, 赵乃瑄 . 基于通用词与术语部件的专利术语抽取[J]. 情报学报, 2018,37(7):742-752.
( Yu Yan, Zhao Naixuan . Patent Term Extraction Based on Generic Words and Term Components[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(7):742-752.)
[1] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[2] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[3] 文秀贤,徐健. 基于用户评论的商品特征提取及特征价格研究 *[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[4] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[5] 张佩瑶,刘东苏. 基于词向量和BTM的短文本话题演化分析*[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[6] 李慧,柴亚青. 基于卷积神经网络的细粒度情感分析方法*[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[7] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[8] 王飞飞,张生太. 移动社交网络微信用户信息发布行为统计特征分析*[J]. 数据分析与知识发现, 2018, 2(4): 99-109.
[9] 胡家珩,岑咏华,吴承尧. 基于深度学习的领域情感词典自动构建*——以金融领域为例[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[10] 夏天. 词向量聚类加权TextRank的关键词抽取*[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[11] 翟东升,胡等金,张杰,何喜军,刘鹤. 专利发明等级分类建模技术研究*[J]. 数据分析与知识发现, 2017, 1(12): 63-73.
[12] 宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[13] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[14] 胡泽文, 王效岳, 白如江. 基于SUMO和WordNet本体集成的文本分类模型研究[J]. 现代图书情报技术, 2011, 27(1): 31-38.
[15] 张巍,于洋,游宏梁. 面向词汇知识库自动构建的概念术语关系识别[J]. 现代图书情报技术, 2009, 25(11): 10-16.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn