Multi-label Patent Classification with Pre-training Model

doi:10.11925/infotech.2096-3467.2021.0930

Data Analysis and Knowledge Discovery

2022, Vol. 6

Issue (2/3): 129-137 DOI: 10.11925/infotech.2096-3467.2021.0930

Current Issue | Archive | Adv Search

Multi-label Patent Classification with Pre-training Model

Tong Xinyu,Zhao Ruijie,Lu Yonghe(

)

School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China

Download: PDF (972 KB) HTML ( 26 )
Export: BibTeX | EndNote (RIS)

Abstract

[Objective] This paper tries to improve the automatic patent classification method and accurately match patent applications with one or more suitable IPC classification numbers. [Methods] We constructed a large-scale Chinese patent dataset (CNPatents), and used the first four digits of IPC classification numbers as labels. Then, we utilized BERT, RoBERTa, and RBT3 models for training and testing. [Results] For our classification task with more than 600 labels, the best model reached an accuracy of 75.6% and a Micro-F1 value of 59.7%. After high-frequency label screening, the accuracy and the Micro-F1 value increased to 91.2% and 71.7%. [Limitations] The patent documents as the training set have extreme data imbalance issue, which needs more research to improve the high-frequency tag screening for the training. [Conclusions] This paper realizes the automatic classification of multi-label patents and further improves the performance of classification model with high-frequency label screening.

Key words： Patent Classification Pre-Training Model Patent Text Representation

Received: 30 August 2021 Published: 14 April 2022

ZTFLH:

G350

Fund:Research and Development Program in Key Areas of Guangdong Province of China(2021B0101420004);Regional Joint Fund Key Projects of Guangdong Province of China(2019B1515120085)

Corresponding Authors: Lu Yonghe,ORCID： 0000-0002-7758-9365 E-mail: luyonghe@mail.sysu.edu.cn

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Xinyu Tong
	Ruijie Zhao
	Yonghe Lu

Cite this article:

Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-label Patent Classification with Pre-training Model. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 129-137.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0930 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I2/3/129

Diagram of the Transformer Model

Word Vector Embedding Processing of Input Text by BERT

Full Word Masking Generation Example

Architecture for Multi-Label Patent Classification Based on Pre-Trained Models

Dataset Details

Details of the Dataset after High-Frequency Tag Filtering

Details of the CNPatents-Large（30） Dataset after High-Frequency Label Filtering

Details of the CNPatents-Small（30） Dataset after High-Frequency Label Filtering

Specific Data of the Model Used in This Paper

Activation Functions and Loss Functions Used for the Classification Task

Experimental Environment Configuration Parameters

Experimental Parameter Settings

Confusion Matrix for Model Evaluation

Results of Multi-Label Patent Classification

Results after High-Frequency Label Screening

[1]	2020知识产权统计年报. 分国内外三种专利申请/授权/有效量(2020年)[R]. 国家知识产权局, 2020.
[1]	(2020 Annual Report on Intellectual Property Statistics. Three Types of Patent Applications/ Grants/Validities by Domestic and Foreign Countries (2020)[R]. China National Intellectual Property Administration, 2020.)
[2]	中国中文信息学会. 中文信息处理发展报告(2016)[R]. 北京, 2016.
[2]	(Chinese Information Processing Society of China. Report on the Development of Chinese Information Processing(2016)[R]. Beijing, 2016.)
[3]	Mikolov T, Chen K, Corrado G S, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[4]	Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[5]	Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018: 2227-2237.
[6]	Vaswania A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 2017: 6000-6010.
[7]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1908.08962v2.
[8]	Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
[9]	Kowsari K, Brown D E, Heidarysafa M, et al. HDLTex: Hierarchical Deep Learning for Text Classification[C]// Proceedings of 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). 2017: 364-371.
[10]	李生珍, 王建新, 齐建东, 等. 基于BP神经网络的专利自动分类方法[J]. 计算机工程与设计, 2010, 31(23):5075-5078.
[10]	( Li Shengzhen, Wang Jianxin, Qi Jiandong, et al. Automated Categorization of Patent Based on Back-Propagation Network[J]. Computer Engineering and Design, 2010, 31(23):5075-5078.)
[11]	Xiao L Z, Wang G Z, Zuo Y. Research on Patent Text Classification Based on Word2Vec and LSTM[C]// Proceedings of the 11th International Symposium on Computational Intelligence and Design (ISCID). 2018: 71-74.
[12]	马双刚. 基于深度学习理论与方法的中文专利文本自动分类研究[D]. 镇江: 江苏大学, 2016.
[12]	( Ma Shuanggang. The Study of Automatic Chinese Patent Classification Based on Deep Learning Theory and Method[D]. Zhenjiang: Jiangsu University, 2016.)
[13]	胡杰, 李少波, 于丽娅, 等. 基于卷积神经网络与随机森林算法的专利文本分类模型[J]. 科学技术与工程, 2018, 18(6):268-272.
[13]	( Hu Jie, Li Shaobo, Yu Liya, et al. A Patent Classification Model Based on Convolutional Neural Networks and Rand Forest[J]. Science Technology and Engineering, 2018, 18(6):268-272.)
[14]	包翔, 刘桂锋, 崔靖华. 多示例多标签学习在中文专利自动分类中的应用研究[J]. 图书情报工作, 2021, 65(8):107-113.
[14]	( Bao Xiang, Liu Guifeng, Cui Jinghua. Application of Multi Instance Multi Label Learning in Chinese Patent Automatic Classification[J]. Library and Information Service, 2021, 65(8):107-113.)
[15]	吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10):75-85.
[15]	( Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10):75-85.)
[16]	Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2):721-744. doi: 10.1007/s11192-018-2905-5
[17]	Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61:101965. doi: 10.1016/j.wpi.2020.101965
[18]	Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29:3504-3514. doi: 10.1109/TASLP.2021.3124365

[1]	Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[2]	Jia Shanshan,Liu Chang,Sun Lianying,Liu Xiaoan,Peng Tao. Patent Classification Based on Multi-feature and Multi-classifier Integration[J]. 数据分析与知识发现, 2017, 1(8): 76-84.
[3]	Hu Zhengyin, Fang Shu, Wen Yi, Zhang Xian, Liang Tian. Study on Automatic Classification of Patents Oriented to TRIZ[J]. 现代图书情报技术, 2015, 31(1): 66-74.
[4]	Hu Zhengyin, Fang Shu. Review on Text-based Patent Technology Mining[J]. 现代图书情报技术, 2014, 30(6): 62-70.
[5]	Ma Haiqun. The Innovation and Development of International Patent Classification on the Environment of Internet[J]. 现代图书情报技术, 2002, 18(6): 41-43.

Viewed

Full text

Abstract

Cited

Shared

Discussed