|
|
Multi-label Patent Classification with Pre-training Model |
Tong Xinyu,Zhao Ruijie,Lu Yonghe() |
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China |
|
|
Abstract [Objective] This paper tries to improve the automatic patent classification method and accurately match patent applications with one or more suitable IPC classification numbers. [Methods] We constructed a large-scale Chinese patent dataset (CNPatents), and used the first four digits of IPC classification numbers as labels. Then, we utilized BERT, RoBERTa, and RBT3 models for training and testing. [Results] For our classification task with more than 600 labels, the best model reached an accuracy of 75.6% and a Micro-F1 value of 59.7%. After high-frequency label screening, the accuracy and the Micro-F1 value increased to 91.2% and 71.7%. [Limitations] The patent documents as the training set have extreme data imbalance issue, which needs more research to improve the high-frequency tag screening for the training. [Conclusions] This paper realizes the automatic classification of multi-label patents and further improves the performance of classification model with high-frequency label screening.
|
Received: 30 August 2021
Published: 14 April 2022
|
|
Fund:Research and Development Program in Key Areas of Guangdong Province of China(2021B0101420004);Regional Joint Fund Key Projects of Guangdong Province of China(2019B1515120085) |
Corresponding Authors:
Lu Yonghe,ORCID: 0000-0002-7758-9365
E-mail: luyonghe@mail.sysu.edu.cn
|
[1] |
2020知识产权统计年报. 分国内外三种专利申请/授权/有效量(2020年)[R]. 国家知识产权局, 2020.
|
[1] |
(2020 Annual Report on Intellectual Property Statistics. Three Types of Patent Applications/ Grants/Validities by Domestic and Foreign Countries (2020)[R]. China National Intellectual Property Administration, 2020.)
|
[2] |
中国中文信息学会. 中文信息处理发展报告(2016)[R]. 北京, 2016.
|
[2] |
(Chinese Information Processing Society of China. Report on the Development of Chinese Information Processing(2016)[R]. Beijing, 2016.)
|
[3] |
Mikolov T, Chen K, Corrado G S, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
|
[4] |
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
|
[5] |
Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018: 2227-2237.
|
[6] |
Vaswania A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 2017: 6000-6010.
|
[7] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1908.08962v2.
|
[8] |
Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
|
[9] |
Kowsari K, Brown D E, Heidarysafa M, et al. HDLTex: Hierarchical Deep Learning for Text Classification[C]// Proceedings of 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). 2017: 364-371.
|
[10] |
李生珍, 王建新, 齐建东, 等. 基于BP神经网络的专利自动分类方法[J]. 计算机工程与设计, 2010, 31(23):5075-5078.
|
[10] |
( Li Shengzhen, Wang Jianxin, Qi Jiandong, et al. Automated Categorization of Patent Based on Back-Propagation Network[J]. Computer Engineering and Design, 2010, 31(23):5075-5078.)
|
[11] |
Xiao L Z, Wang G Z, Zuo Y. Research on Patent Text Classification Based on Word2Vec and LSTM[C]// Proceedings of the 11th International Symposium on Computational Intelligence and Design (ISCID). 2018: 71-74.
|
[12] |
马双刚. 基于深度学习理论与方法的中文专利文本自动分类研究[D]. 镇江: 江苏大学, 2016.
|
[12] |
( Ma Shuanggang. The Study of Automatic Chinese Patent Classification Based on Deep Learning Theory and Method[D]. Zhenjiang: Jiangsu University, 2016.)
|
[13] |
胡杰, 李少波, 于丽娅, 等. 基于卷积神经网络与随机森林算法的专利文本分类模型[J]. 科学技术与工程, 2018, 18(6):268-272.
|
[13] |
( Hu Jie, Li Shaobo, Yu Liya, et al. A Patent Classification Model Based on Convolutional Neural Networks and Rand Forest[J]. Science Technology and Engineering, 2018, 18(6):268-272.)
|
[14] |
包翔, 刘桂锋, 崔靖华. 多示例多标签学习在中文专利自动分类中的应用研究[J]. 图书情报工作, 2021, 65(8):107-113.
|
[14] |
( Bao Xiang, Liu Guifeng, Cui Jinghua. Application of Multi Instance Multi Label Learning in Chinese Patent Automatic Classification[J]. Library and Information Service, 2021, 65(8):107-113.)
|
[15] |
吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10):75-85.
|
[15] |
( Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10):75-85.)
|
[16] |
Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2):721-744.
doi: 10.1007/s11192-018-2905-5
|
[17] |
Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61:101965.
doi: 10.1016/j.wpi.2020.101965
|
[18] |
Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29:3504-3514.
doi: 10.1109/TASLP.2021.3124365
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|