1School of Information & Communication Engineering, Beijing Information Science & Technology University, Beijing 100101, China 2Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science & Technology University, Beijing 100101, China
[Objective] This paper proposes a feature fusion method for patent text classification, aiming to address the low recall issues of the existing methods, which do not utilize the unregistered words. [Methods] First, we fused the sentence vector pre-trained by BERT and the proper noun vector. Then, we used the TF-IDF value of the proper nouns as the weight assigned to the vector. [Results] We examined our model with the self-built patent text corpus. Its accuracy, recall and F1 values were 84.43%, 82.01% and 81.23% respectively. The F1 value was about 5.7% higher than other methods. [Limitations] The experimental data were mainly collected from the field of new energy vehicles, which need to be expanded. [Conclusions] The proposed method could effectively process the unbalanced data and unregistered words in patent texts.
肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion. Data Analysis and Knowledge Discovery, 2022, 6(4): 49-59.
( Wang Xueying, Wang Hao, Zhang Zixuan. Recognizing Semantics of Continuous Strings in Chinese Patent Documents[J]. Data Analysis and Knowledge Discovery, 2018, 2(5):11-22.)
[3]
Li S B, Hu J, Cui Y X, et al. DeepPatent: Paten Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2):721-744.
doi: 10.1007/s11192-018-2905-5
[4]
Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002, 34(1):1-47.
doi: 10.1145/505282.505283
( Yu Yan, Chen Lei, Jiang Jinde, et al. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. Data Analysis and Knowledge Discovery, 2019, 3(9):53-59.)
( Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al. Research on Chinese Word Segmentation Technology for Patent Documents[J]. Chinese Journal of Information Processing, 2010, 24(3):112-116.)
[9]
Lewis D D. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval[C]//Proceedings of the 10th European Conference on Machine Learning. 1998: 4-15.
[10]
Cover T, Hart P. Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
doi: 10.1109/TIT.1967.1053964
[11]
Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
[12]
Luong M T, Socher R, Manning C D. Better Word Representations with Recursive Neural Networks for Morphology[C]//Proceedings of the 17th Conference on Computational Natural Language Learning. 2013: 104-113.
[13]
Zhang X, Zhao J B, LeCun Y. Character-Level Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv: 1509.01626.
[14]
Kim Y. Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[15]
Howard J, Ruder S. Universal Language Model Fine-Tuning for Text Classification[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 328-339.
[16]
Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018: 2227-2237.
[17]
Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre- Training[OL]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[18]
Devlin J, Chang M W, Lee K, et al. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[19]
Wu S C, He Y F. Enriching Pre-Trained Language Model with Entity Information for Relation Classification[OL]. arXiv Preprint, arXiv: 1905.08284.