Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (8): 76-84    DOI: 10.11925/infotech.2096-3467.2017.08.09
Orginal Article Current Issue | Archive | Adv Search |
Patent Classification Based on Multi-feature and Multi-classifier Integration
Jia Shanshan1, Liu Chang2, Sun Lianying3, Liu Xiaoan1, Peng Tao2()
1College of Intellectualized City, Beijing Union University, Beijing 100101, China
2College of Robotics, Beijing Union University, Beijing 100101, China
3College of Urban Rail Transit and Logistics, Beijing Union University, Beijing 100101, China
Download: PDF (706 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to automatically allocate correct IPC to patent applications with the help of multi-feature and multi-classifier integration method. [Methods] First, we extracted the TFIDF features of all dictionaries and information gains, as well as the vector features of document and topic models from patent applications. Then, we used the collected data to train the NB, SVM, and AdaBoost classifiers. Finally, we established the feature-class matrix and predicted the final IPC with the F1 weight matrix. [Results] We examined our new method with 10 patent classes from 2014 to 2016 in the field of engine and pump. The accuracy of top prediction, all categories, and two guesses were 78.9%, 80.1% and 91.2% respectively. [Limitations] The size of training corpus is limited, which only includes 3 years patent data. [Conclusions] The proposed method could effectively improve the accuracy of patent classification in the field of engine and pump.

Key wordsPatent Classification      Document Vector      Topic Model Vector      Classifier Integration     
Received: 31 May 2017      Published: 28 September 2017
ZTFLH:  G250  

Cite this article:

Jia Shanshan,Liu Chang,Sun Lianying,Liu Xiaoan,Peng Tao. Patent Classification Based on Multi-feature and Multi-classifier Integration. Data Analysis and Knowledge Discovery, 2017, 1(8): 76-84.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.08.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I8/76

词(词干还原后) 信息增益值
Smoother 6.64337682087
vesda 6.64274815818
undamp 6.64274815818
engin 6.25488032208
分类算法 特征 评估方法 准确率 F1值 召回率 精确率
NB 全字典TFIDF Top Prediction 71.4% 71.1% 71.4% 72.3%
NB 信息增益TFIDF Top Prediction 43.9% 44.7% 43.9% 46.1%
SVM 信息增益TFIDF Top Prediction 64.6% 64.4% 64.6% 68.0%
AdaBoost 信息增益TFIDF Top Prediction 71.7% 71.9% 71.7% 72.9%
Gaussian-NB 段落向量 Top Prediction 23.3% 21.4% 23.3% 24.3%
SVM 段落向量 Top Prediction 48.4% 48.2% 48.4% 48.7%
AdabBoost 段落向量 Top Prediction 23.6% 23.6% 23.6% 24.1%
Gaussian-NB 主题向量 Top Prediction 39.7% 38.3% 39.7% 39.6%
SVM 主题向量 Top Prediction 41.7% 40.4% 41.7% 42.2%
AdaBoost 主题向量 Top Prediction 41.6% 40.8% 41.6% 40.7%
实验 评估方法 特征 算法 准确率 F1值 召回率 精确率
1 All Categories Ⅰ全字典TFIDF NB 73.6% 73.5% 73.6% 74.6%
2 All Categories Ⅱ信息增益TFIDF AdaBoost 74.0% 73.0% 74.0% 76.7%
3 All Categories Ⅲ段落向量 SVM 49.4% 49.1% 49.4% 49.6%
4 All Categories Ⅳ主题向量 SVM 42.0% 41.3% 42.0% 41.6%
5 All Categories Ⅱ、Ⅲ、Ⅳ直接拼接 Gaussian-NB 31.2% 30.8% 31.2% 31.6%
6 All Categories Ⅲ、Ⅳ特征直接拼接 SVM 34.4% 33.2% 34.4% 34.1%
7 All Categories Ⅰ、Ⅱ、Ⅲ、Ⅳ 投票 72.2% 73.5% 72.2% 74.0%
8 All Categories Ⅱ、Ⅲ、Ⅳ MFMCI 54.1% 52.0% 54.1% 56.6%
9 All Categories Ⅰ、Ⅲ、Ⅳ MFMCI 79.4% 78.8% 79.4% 81.7%
10 All Categories Ⅰ、Ⅱ、Ⅲ、Ⅳ MFMCI 80.1% 79.5% 80.1% 82.4%
11 Top Prediction Ⅰ全字典TFIDF NB 71.4% 71.1% 71.4% 72.3%
12 Top Prediction Ⅱ信息增益TFIDF AdaBoost 71.7% 71.9% 71.7% 72.9%
13 Top Prediction Ⅲ段落向量 SVM 48.4% 48.2% 48.4% 48.7%
14 Top Prediction Ⅳ主题向量 SVM 41.7% 40.4% 41.7% 42.2%
15 Top Prediction Ⅱ、Ⅲ、Ⅳ直接拼接 Gaussian-NB 31.2% 30.8% 31.2% 31.6%
16 Top Prediction Ⅰ、Ⅱ、Ⅲ、Ⅳ MFMCI 78.9% 78.2% 78.9% 81.2%
17 Two Guesses Ⅰ全字典TFIDF NB 88.1% 88.1% 88.1% 88.4%
18 Two Guesses Ⅱ信息增益TFIDF AdaBoost 89.4% 89.2% 89.4% 89.8%
19 Two Guesses Ⅲ段落向量 SVM 68.6% 68.5% 68.6% 68.7%
20 Two Guesses Ⅳ主题向量 SVM 61.8% 61.4% 61.8% 61.9%
21 Two Guesses Ⅰ、Ⅱ、Ⅲ、Ⅳ MFMCI 91.2% 91.0% 91.2% 91.7%
IPC
分类号
F1值 对某篇专利预测概率值
全词典
TFIDF
最优分类器
信息增益
TFIDF
最优分类器
全词典
TFIDF
最优分类器
信息增益
TFIDF
最优分类器
F01L 86.1% 83.4% 66.5% 11.342%
F01N 78.1% 74.2% 0.6% 10.001%
F02B 59.8% 53.8% 10.9% 10.019%
F02C 76.0% 87.2% 0.6% 9.588%
F02D 67.1% 58.3% 9.6% 10.022%
F02M 57.7% 50.6% 3.2% 10.006%
F03D 94.1% 96.4% 0.3% 9.035%
F04B 72.6% 75.6% 7.1% 10.004%
F04C 74.7% 77.2% 1.0% 9.992%
F04D 69.0% 62.7% 0.3% 9.989%
算法 标准 准确率 F1值 召回率
RBFNN(径向基网络) Top Prediction 72.2% 70.7% 71.0%
MFMCI(本文算法) Top Prediction 78.9% 78.2% 78.9%
MFMCI(本文算法) All Categories 80.1% 79.5% 80.1%
MFMCI(本文算法) Two Guesses 91.2% 91.0% 91.2%
IPC分类号 F1值 召回率 精确率
F01L 86.21% 98.8% 76.5%
F01N 85.60% 86.8% 84.4%
F02B 66.67% 53.2% 89.3%
F02C 82.93% 81.6% 84.3%
F02D 73.31% 95.6% 59.5%
F02M 64.99% 51.6% 87.8%
F03D 91.01% 97.2% 85.6%
F04B 79.29% 71.2% 89.5%
F04C 84.54% 90.8% 79.1%
F04D 80.87% 74.4% 88.6%
[1] 蔡虹, 蒋仁爱, 吴凯. 知识产权保护对中国技术进步的贡献研究[J]. 系统管理学报, 2015, 24(3): 314-320.
[1] (Cai Hong, Jiang Renai, Wu Kai.Contribution of Intellectual Property Protection to the Technological Progresses in China[J]. Journal of Systems & Management, 2015, 24(3): 314-320.)
[2] 马芳. 基于RBFNN的专利自动分类研究[J]. 现代图书情报技术, 2011(12): 58-63.
[2] (Ma Fang.Research of Patent Automatic Classification Based on RBFNN[J]. New Technology of Library and Information Service, 2011(12): 58-63.)
[3] 刘桂锋, 汪满容, 刘海军. 基于概率超图半监督学习的专利文本分类方法研究[J]. 情报杂志, 2016 , 35(9) : 187-191, 173.
doi: 10.3969/j.issn.1002-1965.2016.09.033
[3] (Liu Guifeng, Wang Manrong, Liu Haijun.Probabilistic Hypergraph Based Semi-supervised Learning Method for Patent Document Categorization[J]. Journal of Intelligence, 2016, 35(9): 187-191, 173.)
doi: 10.3969/j.issn.1002-1965.2016.09.033
[4] Venugopalan S, Rai V.Topic Based Classification and Pattern Identification in Patents[J]. Technological Forecasting and Social Change, 2015, 94: 236-250.
doi: 10.1016/j.techfore.2014.10.006
[5] 廖列法, 勒孚刚, 朱亚兰. LDA模型在专利文本分类中的应用[J]. 现代情报, 2017, 37(3): 35-39.
[5] (Liao Liefa, Le Fugang, Zhu Yalan.The Application of LDA Model in Patent Text Classification[J]. Journal of Modern Information, 2017, 37(3): 35-39.)
[6] 马双刚. 基于深度学习理论与方法的中文专利自动分类研究[D]. 镇江: 江苏大学, 2016.
[6] (Ma Shuanggang.The Study of Automatic Chinese Patent Classification Based on Deep Learning Theory and Method [D]. Zhenjiang: Jiangsu University, 2016. )
[7] 孔旗. 基于并行机器学习的大规模专利分类[D]. 上海: 上海交通大学, 2011.
[7] (Kong Qi.Large-scale Patent Classification Based on Parallel Machine Learning [D]. Shanghai: Shanghai Jiaotong University, 2011.)
[8] 缪建明, 贾广威, 张运良. 基于摘要文本的专利快速自动分类方法[J]. 情报理论与实践, 2016, 39(8): 103-105, 91.
[8] (Miu Jianming, Jia Guangwei, Zhang Yunliang.The Rapid Automatic Categorization of Patent Based on Abstract Text[J]. Information Studies: Theory & Application, 2016, 39(8): 103-105, 91.)
[9] Le Q V, Mikolov T.Distributed Representations of Sentences and Document[OL]. arXiv Preprint, arXiv: 1405.4053.
[10] Mikolov T.Statistical Language Models Based on Neural Networks[D]. Brno University of Technology, 2012.
[11] Turian J, Ratinov L, Bengio Y.Word Representations: A Simple and General Method for Semi-supervised Learning[C]////Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010: 384-394.
[12] Rosen-Zvi M, Griffiths M, Steyvers M, et al.The Author-topic Model for Authors and Documents[C]//// Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. 2012: 487-494.
[13] Fall C J, Törcsvári A, Benzineb K, et al.Automated Categorization in the International Patent Classification[J] . ACM SIGIR Forum, 2003, 37(1): 10-25.
doi: 10.1145/945546.945547
[1] Hu Zhengyin, Fang Shu, Wen Yi, Zhang Xian, Liang Tian. Study on Automatic Classification of Patents Oriented to TRIZ[J]. 现代图书情报技术, 2015, 31(1): 66-74.
[2] Hu Zhengyin, Fang Shu. Review on Text-based Patent Technology Mining[J]. 现代图书情报技术, 2014, 30(6): 62-70.
[3] Ma Haiqun. The Innovation and Development of International Patent Classification on the Environment of Internet[J]. 现代图书情报技术, 2002, 18(6): 41-43.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn