Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (12): 63-73    DOI: 10.11925/infotech.2096-3467.2017.0820
Orginal Article Current Issue | Archive | Adv Search |
Hierarchical Classification Model for Invention Patents
Zhai Dongsheng, Hu Dengjin(), Zhang Jie, He Xijun, Liu He
School of Economics and Management, Beijing University of Technology, Beijing 100124, China
Download: PDF (1046 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new model to process patent information based on machine learning classification algorithm, aiming to determine the level of invention. [Methods] First, we extracted the technology feature words from the patent texts. Then, we constructed the patent technology feature vector with an algorithm trained by Word2Vec. Third, we calculated patent text indicators and backward references to build the training set. Finally, we constructed the new model with machine learning classification algorithm. [Results] We retrieved patents in the field of speech recognition technology with the proposed model. We found that the proportion of advanced level to entry level patents was around 1:4, which was in line with the actual situation. [Limitations] The WordNet dictionary will limit the results of extraction. [Conclusions] The proposed model could effectively identify the advanced patents and recommend them to the business owners.

Key wordsPatent Invention Level      Technical Feature Vector      Word Vector      Machine Learning     
Received: 15 August 2017      Published: 29 December 2017
ZTFLH:  G350 TP311  

Cite this article:

Zhai Dongsheng,Hu Dengjin,Zhang Jie,He Xijun,Liu He. Hierarchical Classification Model for Invention Patents. Data Analysis and Knowledge Discovery, 2017, 1(12): 63-73.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0820     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I12/63

等级 描述 实验次数 专利百分比
1 1级发明不会消除冲突, 是最小的发明。1级意味着其方法驻留与一个单一的行业的边界, 并且是通过一个相关工程学科掌握的禁言来处理。 1-10 32.0%
2 所解决的问题涉及技术, 该问题通过相关系统的工程学科已知方法可以很容易解决。 10-100 45.0%
3 一个冲突驻留于同一学科的边界(或者说通过同一科学知识就能解决它)。 100-1000 19.0%
4 一个新的技术系统被合成。由于新的系统没有提及解决技术冲突, 或许这个新的发明没有克服该冲突。事实上, 冲突是存在的, 但是他们和旧的技术系统是相关的。在4级发明中, 冲突通过原理问题所属的科学边界来被消除。 1000-10000 ≤4.0%
5 发明就是一个困难问题的复杂网络。而实验次数的无限增长导致了一种全新的系统。这种发明推出一种新的系统, 随着时间的推移其伴随着各种等级的发明。一种新的技术被创造出来。 10000+ ≤0.3%
参数名称 含义 取值
-train 训练数据 Patent.txt
-output 词向量输出文件 Word2vec_model.bin
-cbow 是否使用cbow模型
(1:是, 0:不是)
1
-size 词向量维数 400
-window 上下文窗口 5-10
-threads 线程数 8
-alpha 学习速率 默认值
-min_count 单词最小频数 5
-Algo 使用Negative sampling
真实情况 预测结果
正例 反例
正例 TP FN
反例 FP TN
技术特征词 技术词汇重要性
‘lattice’ 1.012178089
‘module’ 0.40855953
‘concatenate’ 0.253707282
‘multiple’ 0.209988341
‘applies’ 0.165597509
‘field’ 0.148309666
…… ……
‘score’ 0.095205173
‘data’ 0.039488217
‘speech’ 0.02694872
‘recognition’ 0.018010984
专利号 权利要求书_已有
技术_关键词
权利要求书_
同小类_关键词
权利要求书_
同大类_关键词
权利要求书_
其他_关键词
US20020184373A1 0.202702703 0.027027 0 0
US20020161579A1 0.239130435 0 0 0.021739
US20010041980A1 0.246153846 0.015385 0 0
US20040049388A1 0.042857143 0.007143 0 0
US20050143989A1 0.141176471 0 0 0
US20060200348A1 0.193548387 0 0 0
US20060265225A1 0.41025641 0 0 0
US20060293899A1 0.212121212 0.015152 0 0
US6205425B1 0.257142857 0.028571 0 0
US20030055642A1 0.347826087 0 0 0
专利号 权利要求书_已有
技术_非关键词
权利要求书_同
小类_非关键词
权利要求书_同
大类_非关键词
权利要求书_
其他_非关键词
权利要求书_
新词汇
US20020184373A1 0.581081 0.054054 0 0 0.135135
US20020161579A1 0.695652 0.021739 0 0 0.021739
US20010041980A1 0.723077 0 0 0 0.015385
US20040049388A1 0.935714 0.014286 0 0 0
US20050143989A1 0.811765 0.011765 0 0 0.035294
US20060200348A1 0.806452 0 0 0 0
US20060265225A1 0.589744 0 0 0 0
US20060293899A1 0.772727 0 0 0 0
US6205425B1 0.685714 0 0 0 0.028571
US20030055642A1 0.652174 0 0 0 0
专利号 新颖性部分_已有
技术_关键词
新颖性部分_同小类_关键词 新颖性部分_同大类_
关键词
新颖性部分_其他_关键词
US20020184373A1 0.304347826 0.043478 0 0
US20020161579A1 0.666666667 0 0 0
US20010041980A1 0.545454545 0 0 0
US20040049388A1 0.571428571 0 0 0
US20050143989A1 0.75 0 0 0
US20060200348A1 0.818181818 0 0 0
US20060265225A1 0.4375 0 0 0
US20060293899A1 0.444444444 0 0.055556 0
US6205425B1 0.307692308 0.076923 0 0
US20030055642A1 1 0 0 0
专利号 新颖性部分_已有
技术_非关键词
新颖性部分_
同小类_非关键词
新颖性部分_同
大类_非关键词
新颖性部分_
其他_非关键词
新颖性部分_
新词汇
US20020184373A1 0.391304 0 0 0 0.26087
US20020161579A1 0.333333 0 0 0 0
US20010041980A1 0.363636 0 0 0 0.090909
US20040049388A1 0.428571 0 0 0 0
US20050143989A1 0.25 0 0 0 0
US20060200348A1 0.181818 0 0 0 0
US20060265225A1 0.5625 0 0 0 0
US20060293899A1 0.5 0 0 0 0
US6205425B1 0.615385 0 0 0 0
US20030055642A1 0 0 0 0 0
专利号 相同IPC比例 相同小类比例 相同大类比例 其他IPC比例 原创性指标 引用延迟指标
US20020184373A1 0.222222222 0.666667 0.111111 0 0.839111 1.256112
US20020161579A1 0.130434783 0.217391 0.086957 0.565217 0.93077 0.362692
US20010041980A1 0.888888889 0.111111 0 0 0.865133 0.964108
US20040049388A1 0.846153846 0.128205 0 0.025641 0.945875 0.76467
US20050143989A1 0.733333333 0.233333 0.033333 0 0.8492 0.618055
US20060200348A1 0.545454545 0.363636 0 0.090909 0.799255 0.280562
US20060265225A1 0.333333333 0.333333 0 0.333333 0.48 0.727931
US20060293899A1 0.285714286 0.285714 0.214286 0.214286 0.925187 0.15219
US6205425B1 0.333333333 0.666667 0 0 0.328125 0.646833
US20030055642A1 0.444444444 0.333333 0 0.222222 0.577402 0.361963
训练算法 准确率 召回率 F1值
贝叶斯 70.40% 73.20% 0.7177
决策树 68.30% 60.20% 0.6399
随机森林 81.20% 75.10% 0.7803
支持向量机 73.90% 72.70% 0.7329
逻辑回归 69.40% 70.50% 0.6994
人工神经网络 83.50% 80.10% 0.8176
[1] Mann D L.Better Technology Forecasting Using Systematic Innovation Methods[J]. Technological Forecasting & Social Change, 2003, 70(8): 779-795.
doi: 10.1016/S0040-1625(02)00357-8
[2] 张剑, 屈丹, 李真. 基于词向量特征的循环神经网络语言模型[J]. 模式识别与人工智能, 2015, 28(4): 299-305.
doi: 10.16451/j.cnki.issn1003-6059.201504002
[2] (Zhang Jian, Qu Dan, Li Zhen.Recurrent Neural Network Language Model Based on Word Vector Features[J]. Pattern Recognition and Artificial Intelligence, 2015, 28(4): 299-305.)
doi: 10.16451/j.cnki.issn1003-6059.201504002
[3] Bengio Y.Deep Learning of Representations: Looking Forward[C]// Proceedings of the 1st International Conference on Statistical Language and Speech Processing, Tarragona, Spain. Berlin, Heidelberg: Springer, 2013: 1-37.
[4] Wolf L, Hanani Y, Bar K, et al.Joint Word2Vec Networks for Bilingual Semantic Representations[J]. International Journal of Computational Linguistics and Applications, 2014, 5(1): 27-44.
[5] Su Z, Xu H, Zhang D, et al.Chinese Sentiment Classification Using a Neural Network Tool—Word2Vec[C]//Proceedings of the 2014 International Conference on Multisensor Fusion and Information Integration for Intelligent Systems, Beijing, China. Piscataway, USA: IEEE, 2014: 1-6.
[6] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[7] 根里奇·斯拉维奇·阿奇舒勒. 创新算法[M]. 谭培波, 茹海燕, Wenling Babbitt 译. 武汉: 华中科技大学出版社, 2008.
[7] (Genrikh Altshuller.The Innovation Algorithm: TRIZ, Systematic Innovation and Technical Creativity [M]. Translated by Tan Peibo, Ru Haiyan, Wenling Babbitt. Wuhan: Huazhong University of Science and Technology Press, 2008.)
[8] Li Z, Tate D, Lane C, et al.A Framework for Automatic TRIZ Level of Invention Estimation of Patents Using Natural Language Processing, Knowledge-transfer and Patent Citation Metrics[J]. Computer-Aided Design, 2012, 44(10): 987-1010.
doi: 10.1016/j.cad.2011.12.006
[9] 王艳领. 专利等级划分方法的研究与实现[D]. 天津: 河北工业大学, 2011.
[9] (Wang Yanling.Research and Implementation of the Mean of the Patent Classification [D]. Tianjin: Hebei University of Technology, 2011.)
[10] Regazzoni D, Nani R.TRIZ-Based Patent Investigation by Evaluating Inventiveness[A]// Computer-Aided Innovation (CAI)[M]. Springer US, 2008: 247-258.
[11] Verbitsky M.Semantic TRIZ[R]. Boston: Invention Machine Corporation, 2004.
[12] 张惠, 邱清盈, 冯培恩, 等. 产品专利设计知识获取方法研究[J]. 哈尔滨工程大学学报, 2009, 30(7): 785-791.
doi: 10.3969/j.issn.1006-7043.2009.07.012
[12] (Zhang Hui, Qiu Qingying, Feng Peien, et al.An Automated Method for Acquiring Design Knowledge from Product Patents[J]. Journal of Harbin Engineering University, 2009, 30(7): 785-791.)
doi: 10.3969/j.issn.1006-7043.2009.07.012
[13] 袁里驰. 基于改进的隐马尔科夫模型的词性标注方法[J]. 中南大学学报: 自然科学版, 2012, 43(8): 3053-3057.
[13] (Yuan Lichi.A Part-of-Speech Tagging Method Based on Improved Hidden Markov Model[J]. Jouranl of Central South University: Science and Technology, 2012, 43(8): 3053-3057.)
[14] Porter M F.An Algorithm for Suffix Stripping[A]// Readings in Information Retrieval[M]. Morgan Kaufmann Publishers Inc., 2006: 130-137.
[15] 吴思竹, 钱庆, 胡铁军, 等. 词形还原方法及实现工具比较分析[J]. 现代图书情报技术, 2012(3): 27-34.
[15] (Wu Sizhu, Qian Qing, Hu Tiejun, et al.Contrast Analysis of Methods and Tools for Lemmatization[J]. New Technology of Library and Information Service, 2012(3): 27-34.)
[16] 饶齐, 王裴岩, 张桂平. 面向中文专利SAO结构抽取的文本特征比较研究[J]. 北京大学学报: 自然科学版, 2015, 51(2): 349-356.
doi: 10.13209/j.0479-8023.2015.049
[16] (Rao Qi, Wang Peiyan, Zhang Guiping.Text Feature Analysis on SAO Structure Extraction from Chinese Patent Literatures[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 349-356.)
doi: 10.13209/j.0479-8023.2015.049
[17] 李欣, 王静静, 杨梓, 等. 基于SAO结构语义分析的新兴技术识别研究[J]. 情报杂志, 2016, 35(3): 80-84.
[17] (Li Xin, Wang Jingjing, Yang Zi, et al.Identifying Emerging Technologies Based on Subject-Action-Object[J]. Journal of Intelligence, 2016, 35(3): 80-84.)
[18] 许幸荣. 基于SAO结构分析的技术发展路径预测研究[D]. 北京: 北京理工大学, 2015.
[18] (Xu Xingrong.Research on Forecasting Technological Development Paths Based on SAO Structure Analysis[D]. Beijing: Beijing Institute of Technology, 2015.)
[1] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[2] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[3] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[4] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[5] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[6] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[7] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[8] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[9] Xiuxian Wen,Jian Xu. Research on Product Characteristics Extraction and Hedonic Price Based on User Comments[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[10] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[11] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[12] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[13] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[14] Jing Li,Shuxiao Pan,Xueyan Li,Lijing Jia,Yuzhuo Zhao. Screening Critical Patients with Optimized Classifier Based on Multi Objective Quantum[J]. 数据分析与知识发现, 2019, 3(12): 101-112.
[15] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn