Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 129-137    DOI: 10.11925/infotech.2096-3467.2021.0930
Current Issue | Archive | Adv Search |
Multi-label Patent Classification with Pre-training Model
Tong Xinyu,Zhao Ruijie,Lu Yonghe()
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
Download: PDF (972 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to improve the automatic patent classification method and accurately match patent applications with one or more suitable IPC classification numbers. [Methods] We constructed a large-scale Chinese patent dataset (CNPatents), and used the first four digits of IPC classification numbers as labels. Then, we utilized BERT, RoBERTa, and RBT3 models for training and testing. [Results] For our classification task with more than 600 labels, the best model reached an accuracy of 75.6% and a Micro-F1 value of 59.7%. After high-frequency label screening, the accuracy and the Micro-F1 value increased to 91.2% and 71.7%. [Limitations] The patent documents as the training set have extreme data imbalance issue, which needs more research to improve the high-frequency tag screening for the training. [Conclusions] This paper realizes the automatic classification of multi-label patents and further improves the performance of classification model with high-frequency label screening.

Key wordsPatent Classification      Pre-Training Model      Patent Text Representation     
Received: 30 August 2021      Published: 14 April 2022
ZTFLH:  G350  
Fund:Research and Development Program in Key Areas of Guangdong Province of China(2021B0101420004);Regional Joint Fund Key Projects of Guangdong Province of China(2019B1515120085)
Corresponding Authors: Lu Yonghe,ORCID: 0000-0002-7758-9365     E-mail: luyonghe@mail.sysu.edu.cn

Cite this article:

Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-label Patent Classification with Pre-training Model. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 129-137.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0930     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I2/3/129

Diagram of the Transformer Model
Word Vector Embedding Processing of Input Text by BERT
说明 样例
原始文本 使用语言模型来预测下一个词的probability。
分词文本 使用 语言 模型 来 预测 下 一个 词 的 probability 。
原始遮盖输入 使 用 语 言 [MASK] 型 来 [MASK] 测 下 一 个 词 的 pro [MASK] ##lity 。
全词遮盖输入 使 用 语 言 [MASK] [MASK] 来 [MASK] [MASK] 下 一 个 词 的 [MASK] [MASK] [MASK] 。
Full Word Masking Generation Example
Architecture for Multi-Label Patent Classification Based on Pre-Trained Models
数据集名称 专利文本数量 标签数量 训练集 测试集
CNPatents-Large 1 033 917 654 827 134 206 783
CNPatents-Small 398 527 638 318 822 79 705
Dataset Details
数据集 专利文本量 标签数量 训练集 测试集
CNPatents-Large(30) 685 133 30 548 106 137 027
CNPatents-Small(30) 314 424 30 251 539 62 885
Details of the Dataset after High-Frequency Tag Filtering
分类标签 文本数/篇 分类标签 文本数/篇 分类标签 文本数/篇
G06F 43 183 H04W 12 522 G01R 7 086
Y02E 28 405 Y02P 12 126 H02J 7 086
G06K 26 263 H01M 9 922 G01S 6 809
H04L 24 516 Y02A 9 625 G05B 6 506
G06Q 23 626 Y02T 8 473 C08L 6 494
G01N 17 212 B01D 8 201 Y02B 6 484
G06N 15 570 A61K 8 191 C04B 6 410
G06T 15 059 C02F 7 370 B01J 6 340
H01L 13 611 A61B 7 215 C22C 6 288
H04N 12 730 C08K 7 120 G02B 6 189
Details of the CNPatents-Large(30) Dataset after High-Frequency Label Filtering
分类标签 文本数/篇 分类标签 文本数/篇 分类标签 文本数/篇
G06F 12 344 H04W 2 819 G01R 1 740
G06K 9 209 Y02P 2 554 A61K 1 714
G06Q 7 293 H01M 2 537 G08G 1 681
Y02E 7 054 Y02T 2 226 Y02B 1 654
G06N 6 566 Y02A 2 196 H02J 1 653
H04L 6 534 B01D 1 931 G05D 1 637
G06T 4 885 G01S 1 883 G05B 1 635
G01N 4 110 A61B 1 825 B25J 1 618
H01L 3 725 C22C 1 793 F24F 1 599
H04N 3 616 B08B 1 792 G02B 1 560
Details of the CNPatents-Small(30) Dataset after High-Frequency Label Filtering
属性

模型
BERT-wwm-ext RoBERTa-wwm-ext RBT3
词汇遮盖方式 Whole Word Masking Whole Word Masking Whole Word Masking
原始模型 BERT-base BERT-base BERT-base
数据来源 中文维基百科,其他百科、新闻、问答等数据,总词数达54亿。 中文维基百科,其他百科、新闻、问答等数据,总词数达54亿。 中文维基百科,其他百科、新闻、问答等数据,总词数达54亿。
训练步长 1MMAX128+400KMAX512 1MMAX512 1MMAX512+1MMAX512
训练集样本数量 2,560 / 384 384 384
优化器 LAMB AdamW AdamW
词汇表 21 128 21 128 21 128
Specific Data of the Model Used in This Paper
分类任务名称 输出层使用的激活函数 对应的损失函数
多标签分类 Sigmoid() BCEWithLogitsLoss()
Activation Functions and Loss Functions Used for the Classification Task
环境 配置参数
处理器 INTEL XEON GOLD 6139M (2.3~3.7 GB)
显卡 NVIDIA GeForce RTX 2080 Ti
内存 8 × 11 GB
操作系统 Ubuntu 16.04 64bit
语言 Python
Experimental Environment Configuration Parameters
参数 设定值
MAX_LEN 200
TRAIN_BATCH_SIZE 16
VALID_BATCH_SIZE 16
EPOCHS 3
LEARNING_RATE 1e-5
Experimental Parameter Settings
预测值

真实值
Positive Negative
Positive TP(预测结果为正的正样本) FP(预测结果为正的负样本)
Negative FN(预测结果为负的正样本) TN(预测结果为负的负样本)
Confusion Matrix for Model Evaluation
模型

数据集及指标
CNPatents-Large CNPatents-Small
准确率 Micro-F1 准确率 Micro-F1
BERT-wwm-ext 0.659 0.597 0.756 0.506
RoBERTa-wwm-ext 0.657 0.594 0.746 0.470
RBT3 0.646 0.567 0.736 0.439
Results of Multi-Label Patent Classification
模型

数据集及指标
CNPatents-Large CNPatents-Small
准确率 Micro-F1 准确率 Micro-F1
BERT-wwm-ext 0.862 0.717 0.912 0.693
RoBERTa-wwm-ext 0.863 0.717 0.912 0.696
RBT3 0.860 0.707 0.910 0.669
Results after High-Frequency Label Screening
[1] 2020知识产权统计年报. 分国内外三种专利申请/授权/有效量(2020年)[R]. 国家知识产权局, 2020.
[1] (2020 Annual Report on Intellectual Property Statistics. Three Types of Patent Applications/ Grants/Validities by Domestic and Foreign Countries (2020)[R]. China National Intellectual Property Administration, 2020.)
[2] 中国中文信息学会. 中文信息处理发展报告(2016)[R]. 北京, 2016.
[2] (Chinese Information Processing Society of China. Report on the Development of Chinese Information Processing(2016)[R]. Beijing, 2016.)
[3] Mikolov T, Chen K, Corrado G S, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[4] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[5] Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018: 2227-2237.
[6] Vaswania A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 2017: 6000-6010.
[7] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1908.08962v2.
[8] Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
[9] Kowsari K, Brown D E, Heidarysafa M, et al. HDLTex: Hierarchical Deep Learning for Text Classification[C]// Proceedings of 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). 2017: 364-371.
[10] 李生珍, 王建新, 齐建东, 等. 基于BP神经网络的专利自动分类方法[J]. 计算机工程与设计, 2010, 31(23):5075-5078.
[10] ( Li Shengzhen, Wang Jianxin, Qi Jiandong, et al. Automated Categorization of Patent Based on Back-Propagation Network[J]. Computer Engineering and Design, 2010, 31(23):5075-5078.)
[11] Xiao L Z, Wang G Z, Zuo Y. Research on Patent Text Classification Based on Word2Vec and LSTM[C]// Proceedings of the 11th International Symposium on Computational Intelligence and Design (ISCID). 2018: 71-74.
[12] 马双刚. 基于深度学习理论与方法的中文专利文本自动分类研究[D]. 镇江: 江苏大学, 2016.
[12] ( Ma Shuanggang. The Study of Automatic Chinese Patent Classification Based on Deep Learning Theory and Method[D]. Zhenjiang: Jiangsu University, 2016.)
[13] 胡杰, 李少波, 于丽娅, 等. 基于卷积神经网络与随机森林算法的专利文本分类模型[J]. 科学技术与工程, 2018, 18(6):268-272.
[13] ( Hu Jie, Li Shaobo, Yu Liya, et al. A Patent Classification Model Based on Convolutional Neural Networks and Rand Forest[J]. Science Technology and Engineering, 2018, 18(6):268-272.)
[14] 包翔, 刘桂锋, 崔靖华. 多示例多标签学习在中文专利自动分类中的应用研究[J]. 图书情报工作, 2021, 65(8):107-113.
[14] ( Bao Xiang, Liu Guifeng, Cui Jinghua. Application of Multi Instance Multi Label Learning in Chinese Patent Automatic Classification[J]. Library and Information Service, 2021, 65(8):107-113.)
[15] 吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10):75-85.
[15] ( Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10):75-85.)
[16] Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2):721-744.
doi: 10.1007/s11192-018-2905-5
[17] Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61:101965.
doi: 10.1016/j.wpi.2020.101965
[18] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29:3504-3514.
doi: 10.1109/TASLP.2021.3124365
[1] Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[2] Jia Shanshan,Liu Chang,Sun Lianying,Liu Xiaoan,Peng Tao. Patent Classification Based on Multi-feature and Multi-classifier Integration[J]. 数据分析与知识发现, 2017, 1(8): 76-84.
[3] Hu Zhengyin, Fang Shu, Wen Yi, Zhang Xian, Liang Tian. Study on Automatic Classification of Patents Oriented to TRIZ[J]. 现代图书情报技术, 2015, 31(1): 66-74.
[4] Hu Zhengyin, Fang Shu. Review on Text-based Patent Technology Mining[J]. 现代图书情报技术, 2014, 30(6): 62-70.
[5] Ma Haiqun. The Innovation and Development of International Patent Classification on the Environment of Internet[J]. 现代图书情报技术, 2002, 18(6): 41-43.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn