Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (8): 76-84     https://doi.org/10.11925/infotech.2096-3467.2017.08.09
  首届"数据分析与知识发现"学术研讨会专辑(II) 本期目录 | 过刊浏览 | 高级检索 |
基于多特征多分类器集成的专利自动分类研究*
贾杉杉1, 刘畅2, 孙连英3, 刘小安1, 彭涛2()
1北京联合大学智慧城市学院 北京 100101
2北京联合大学机器人学院 北京 100101
3北京联合大学城市轨道交通与物流学院 北京 100101
Patent Classification Based on Multi-feature and Multi-classifier Integration
Jia Shanshan1, Liu Chang2, Sun Lianying3, Liu Xiaoan1, Peng Tao2()
1College of Intellectualized City, Beijing Union University, Beijing 100101, China
2College of Robotics, Beijing Union University, Beijing 100101, China
3College of Urban Rail Transit and Logistics, Beijing Union University, Beijing 100101, China
全文: PDF (706 KB)   HTML ( 4
输出: BibTeX | EndNote (RIS)      
摘要 

目的】为了准确地给专利申请书分配IPC分类号, 本文提出一种基于多特征多分类器集成的专利自动分类方法。【方法】使用从专利申请书中提取的全词典TFIDF特征、信息增益词典TFIDF特征、段落向量特征、主题模型向量特征, 分别训练朴素贝叶斯、支持向量机、AdaBoost分类器, 以此构建特征-类别矩阵, 并结合F1权重矩阵集成, 获得最终IPC预测分类号。【结果】对2014年-2016年“发动机或泵”领域的10个小类进行分类, 使用Top Prediction、All Categories和Two Guesses三种评估方法得到准确率分别为: 78.9%、80.1%、91.2%。【局限】训练仅仅使用了2014年-2016年共三年的专利数据, 数据规模有限。【结论】在“发动机或泵”领域, 本文方法能够有效地提高专利文本分类的准确率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
贾杉杉
刘畅
孙连英
刘小安
彭涛
关键词 专利分类段落向量主题向量分类器集成    
Abstract

[Objective] This paper aims to automatically allocate correct IPC to patent applications with the help of multi-feature and multi-classifier integration method. [Methods] First, we extracted the TFIDF features of all dictionaries and information gains, as well as the vector features of document and topic models from patent applications. Then, we used the collected data to train the NB, SVM, and AdaBoost classifiers. Finally, we established the feature-class matrix and predicted the final IPC with the F1 weight matrix. [Results] We examined our new method with 10 patent classes from 2014 to 2016 in the field of engine and pump. The accuracy of top prediction, all categories, and two guesses were 78.9%, 80.1% and 91.2% respectively. [Limitations] The size of training corpus is limited, which only includes 3 years patent data. [Conclusions] The proposed method could effectively improve the accuracy of patent classification in the field of engine and pump.

Key wordsPatent Classification    Document Vector    Topic Model Vector    Classifier Integration
收稿日期: 2017-05-31      出版日期: 2017-09-28
ZTFLH:  G250  
基金资助:*本文系国家重点研发计划项目“公共安全风险防控与应急技术装备”(项目编号: 2016YFC0802107)和北京市教育委员会科技计划面上项目(项目编号: SQKM201411417013)的研究成果之一
引用本文:   
贾杉杉, 刘畅, 孙连英, 刘小安, 彭涛. 基于多特征多分类器集成的专利自动分类研究*[J]. 数据分析与知识发现, 2017, 1(8): 76-84.
Jia Shanshan,Liu Chang,Sun Lianying,Liu Xiaoan,Peng Tao. Patent Classification Based on Multi-feature and Multi-classifier Integration. Data Analysis and Knowledge Discovery, 2017, 1(8): 76-84.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.08.09      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I8/76
  系统整体设计框架
  段落向量算法示意图[9]
  评估方法[13]
词(词干还原后) 信息增益值
Smoother 6.64337682087
vesda 6.64274815818
undamp 6.64274815818
engin 6.25488032208
  部分词信息增益值
分类算法 特征 评估方法 准确率 F1值 召回率 精确率
NB 全字典TFIDF Top Prediction 71.4% 71.1% 71.4% 72.3%
NB 信息增益TFIDF Top Prediction 43.9% 44.7% 43.9% 46.1%
SVM 信息增益TFIDF Top Prediction 64.6% 64.4% 64.6% 68.0%
AdaBoost 信息增益TFIDF Top Prediction 71.7% 71.9% 71.7% 72.9%
Gaussian-NB 段落向量 Top Prediction 23.3% 21.4% 23.3% 24.3%
SVM 段落向量 Top Prediction 48.4% 48.2% 48.4% 48.7%
AdabBoost 段落向量 Top Prediction 23.6% 23.6% 23.6% 24.1%
Gaussian-NB 主题向量 Top Prediction 39.7% 38.3% 39.7% 39.6%
SVM 主题向量 Top Prediction 41.7% 40.4% 41.7% 42.2%
AdaBoost 主题向量 Top Prediction 41.6% 40.8% 41.6% 40.7%
  各分类器不同特征下表现效果
实验 评估方法 特征 算法 准确率 F1值 召回率 精确率
1 All Categories Ⅰ全字典TFIDF NB 73.6% 73.5% 73.6% 74.6%
2 All Categories Ⅱ信息增益TFIDF AdaBoost 74.0% 73.0% 74.0% 76.7%
3 All Categories Ⅲ段落向量 SVM 49.4% 49.1% 49.4% 49.6%
4 All Categories Ⅳ主题向量 SVM 42.0% 41.3% 42.0% 41.6%
5 All Categories Ⅱ、Ⅲ、Ⅳ直接拼接 Gaussian-NB 31.2% 30.8% 31.2% 31.6%
6 All Categories Ⅲ、Ⅳ特征直接拼接 SVM 34.4% 33.2% 34.4% 34.1%
7 All Categories Ⅰ、Ⅱ、Ⅲ、Ⅳ 投票 72.2% 73.5% 72.2% 74.0%
8 All Categories Ⅱ、Ⅲ、Ⅳ MFMCI 54.1% 52.0% 54.1% 56.6%
9 All Categories Ⅰ、Ⅲ、Ⅳ MFMCI 79.4% 78.8% 79.4% 81.7%
10 All Categories Ⅰ、Ⅱ、Ⅲ、Ⅳ MFMCI 80.1% 79.5% 80.1% 82.4%
11 Top Prediction Ⅰ全字典TFIDF NB 71.4% 71.1% 71.4% 72.3%
12 Top Prediction Ⅱ信息增益TFIDF AdaBoost 71.7% 71.9% 71.7% 72.9%
13 Top Prediction Ⅲ段落向量 SVM 48.4% 48.2% 48.4% 48.7%
14 Top Prediction Ⅳ主题向量 SVM 41.7% 40.4% 41.7% 42.2%
15 Top Prediction Ⅱ、Ⅲ、Ⅳ直接拼接 Gaussian-NB 31.2% 30.8% 31.2% 31.6%
16 Top Prediction Ⅰ、Ⅱ、Ⅲ、Ⅳ MFMCI 78.9% 78.2% 78.9% 81.2%
17 Two Guesses Ⅰ全字典TFIDF NB 88.1% 88.1% 88.1% 88.4%
18 Two Guesses Ⅱ信息增益TFIDF AdaBoost 89.4% 89.2% 89.4% 89.8%
19 Two Guesses Ⅲ段落向量 SVM 68.6% 68.5% 68.6% 68.7%
20 Two Guesses Ⅳ主题向量 SVM 61.8% 61.4% 61.8% 61.9%
21 Two Guesses Ⅰ、Ⅱ、Ⅲ、Ⅳ MFMCI 91.2% 91.0% 91.2% 91.7%
  不同特征组合与分类器集成的实验结果
IPC
分类号
F1值 对某篇专利预测概率值
全词典
TFIDF
最优分类器
信息增益
TFIDF
最优分类器
全词典
TFIDF
最优分类器
信息增益
TFIDF
最优分类器
F01L 86.1% 83.4% 66.5% 11.342%
F01N 78.1% 74.2% 0.6% 10.001%
F02B 59.8% 53.8% 10.9% 10.019%
F02C 76.0% 87.2% 0.6% 9.588%
F02D 67.1% 58.3% 9.6% 10.022%
F02M 57.7% 50.6% 3.2% 10.006%
F03D 94.1% 96.4% 0.3% 9.035%
F04B 72.6% 75.6% 7.1% 10.004%
F04C 74.7% 77.2% 1.0% 9.992%
F04D 69.0% 62.7% 0.3% 9.989%
  全词典TFIDF最优分类器和信息增益TFIDF最优分类器的区别
算法 标准 准确率 F1值 召回率
RBFNN(径向基网络) Top Prediction 72.2% 70.7% 71.0%
MFMCI(本文算法) Top Prediction 78.9% 78.2% 78.9%
MFMCI(本文算法) All Categories 80.1% 79.5% 80.1%
MFMCI(本文算法) Two Guesses 91.2% 91.0% 91.2%
  本文与其他工作效果对比
IPC分类号 F1值 召回率 精确率
F01L 86.21% 98.8% 76.5%
F01N 85.60% 86.8% 84.4%
F02B 66.67% 53.2% 89.3%
F02C 82.93% 81.6% 84.3%
F02D 73.31% 95.6% 59.5%
F02M 64.99% 51.6% 87.8%
F03D 91.01% 97.2% 85.6%
F04B 79.29% 71.2% 89.5%
F04C 84.54% 90.8% 79.1%
F04D 80.87% 74.4% 88.6%
  MFMCI对各个类别的预测效果
[1] 蔡虹, 蒋仁爱, 吴凯. 知识产权保护对中国技术进步的贡献研究[J]. 系统管理学报, 2015, 24(3): 314-320.
[1] (Cai Hong, Jiang Renai, Wu Kai.Contribution of Intellectual Property Protection to the Technological Progresses in China[J]. Journal of Systems & Management, 2015, 24(3): 314-320.)
[2] 马芳. 基于RBFNN的专利自动分类研究[J]. 现代图书情报技术, 2011(12): 58-63.
[2] (Ma Fang.Research of Patent Automatic Classification Based on RBFNN[J]. New Technology of Library and Information Service, 2011(12): 58-63.)
[3] 刘桂锋, 汪满容, 刘海军. 基于概率超图半监督学习的专利文本分类方法研究[J]. 情报杂志, 2016 , 35(9) : 187-191, 173.
doi: 10.3969/j.issn.1002-1965.2016.09.033
[3] (Liu Guifeng, Wang Manrong, Liu Haijun.Probabilistic Hypergraph Based Semi-supervised Learning Method for Patent Document Categorization[J]. Journal of Intelligence, 2016, 35(9): 187-191, 173.)
doi: 10.3969/j.issn.1002-1965.2016.09.033
[4] Venugopalan S, Rai V.Topic Based Classification and Pattern Identification in Patents[J]. Technological Forecasting and Social Change, 2015, 94: 236-250.
doi: 10.1016/j.techfore.2014.10.006
[5] 廖列法, 勒孚刚, 朱亚兰. LDA模型在专利文本分类中的应用[J]. 现代情报, 2017, 37(3): 35-39.
[5] (Liao Liefa, Le Fugang, Zhu Yalan.The Application of LDA Model in Patent Text Classification[J]. Journal of Modern Information, 2017, 37(3): 35-39.)
[6] 马双刚. 基于深度学习理论与方法的中文专利自动分类研究[D]. 镇江: 江苏大学, 2016.
[6] (Ma Shuanggang.The Study of Automatic Chinese Patent Classification Based on Deep Learning Theory and Method [D]. Zhenjiang: Jiangsu University, 2016. )
[7] 孔旗. 基于并行机器学习的大规模专利分类[D]. 上海: 上海交通大学, 2011.
[7] (Kong Qi.Large-scale Patent Classification Based on Parallel Machine Learning [D]. Shanghai: Shanghai Jiaotong University, 2011.)
[8] 缪建明, 贾广威, 张运良. 基于摘要文本的专利快速自动分类方法[J]. 情报理论与实践, 2016, 39(8): 103-105, 91.
[8] (Miu Jianming, Jia Guangwei, Zhang Yunliang.The Rapid Automatic Categorization of Patent Based on Abstract Text[J]. Information Studies: Theory & Application, 2016, 39(8): 103-105, 91.)
[9] Le Q V, Mikolov T.Distributed Representations of Sentences and Document[OL]. arXiv Preprint, arXiv: 1405.4053.
[10] Mikolov T.Statistical Language Models Based on Neural Networks[D]. Brno University of Technology, 2012.
[11] Turian J, Ratinov L, Bengio Y.Word Representations: A Simple and General Method for Semi-supervised Learning[C]////Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010: 384-394.
[12] Rosen-Zvi M, Griffiths M, Steyvers M, et al.The Author-topic Model for Authors and Documents[C]//// Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. 2012: 487-494.
[13] Fall C J, Törcsvári A, Benzineb K, et al.Automated Categorization in the International Patent Classification[J] . ACM SIGIR Forum, 2003, 37(1): 10-25.
doi: 10.1145/945546.945547
[1] 胡正银, 方曙, 文奕, 张娴, 梁田. 面向TRIZ的专利自动分类研究[J]. 现代图书情报技术, 2015, 31(1): 66-74.
[2] 胡正银, 方曙. 专利文本技术挖掘研究进展综述[J]. 现代图书情报技术, 2014, 30(6): 62-70.
[3] 马海群. 网络环境下的国际专利分类法IPC变革与发展[J]. 现代图书情报技术, 2002, 18(6): 41-43.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn