[Objective] This paper tries to improve the automatic patent classification method and accurately match patent applications with one or more suitable IPC classification numbers. [Methods] We constructed a large-scale Chinese patent dataset (CNPatents), and used the first four digits of IPC classification numbers as labels. Then, we utilized BERT, RoBERTa, and RBT3 models for training and testing. [Results] For our classification task with more than 600 labels, the best model reached an accuracy of 75.6% and a Micro-F1 value of 59.7%. After high-frequency label screening, the accuracy and the Micro-F1 value increased to 91.2% and 71.7%. [Limitations] The patent documents as the training set have extreme data imbalance issue, which needs more research to improve the high-frequency tag screening for the training. [Conclusions] This paper realizes the automatic classification of multi-label patents and further improves the performance of classification model with high-frequency label screening.
(2020 Annual Report on Intellectual Property Statistics. Three Types of Patent Applications/ Grants/Validities by Domestic and Foreign Countries (2020)[R]. China National Intellectual Property Administration, 2020.)
[2]
中国中文信息学会. 中文信息处理发展报告(2016)[R]. 北京, 2016.
[2]
(Chinese Information Processing Society of China. Report on the Development of Chinese Information Processing(2016)[R]. Beijing, 2016.)
[3]
Mikolov T, Chen K, Corrado G S, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[4]
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[5]
Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018: 2227-2237.
[6]
Vaswania A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 2017: 6000-6010.
[7]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1908.08962v2.
[8]
Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
[9]
Kowsari K, Brown D E, Heidarysafa M, et al. HDLTex: Hierarchical Deep Learning for Text Classification[C]// Proceedings of 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). 2017: 364-371.
( Li Shengzhen, Wang Jianxin, Qi Jiandong, et al. Automated Categorization of Patent Based on Back-Propagation Network[J]. Computer Engineering and Design, 2010, 31(23):5075-5078.)
[11]
Xiao L Z, Wang G Z, Zuo Y. Research on Patent Text Classification Based on Word2Vec and LSTM[C]// Proceedings of the 11th International Symposium on Computational Intelligence and Design (ISCID). 2018: 71-74.
[12]
马双刚. 基于深度学习理论与方法的中文专利文本自动分类研究[D]. 镇江: 江苏大学, 2016.
[12]
( Ma Shuanggang. The Study of Automatic Chinese Patent Classification Based on Deep Learning Theory and Method[D]. Zhenjiang: Jiangsu University, 2016.)
( Hu Jie, Li Shaobo, Yu Liya, et al. A Patent Classification Model Based on Convolutional Neural Networks and Rand Forest[J]. Science Technology and Engineering, 2018, 18(6):268-272.)
( Bao Xiang, Liu Guifeng, Cui Jinghua. Application of Multi Instance Multi Label Learning in Chinese Patent Automatic Classification[J]. Library and Information Service, 2021, 65(8):107-113.)
( Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10):75-85.)
[16]
Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2):721-744.
doi: 10.1007/s11192-018-2905-5
[17]
Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61:101965.
doi: 10.1016/j.wpi.2020.101965
[18]
Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29:3504-3514.
doi: 10.1109/TASLP.2021.3124365