|
|
Classifying Chinese Patent Texts with Feature Fusion |
Xiao Yuejun1,2,Li Honglian1,Zhang Le2( ),Lv Xueqiang2,You Xindong2 |
1School of Information & Communication Engineering, Beijing Information Science & Technology University, Beijing 100101, China 2Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science & Technology University, Beijing 100101, China |
|
|
Abstract [Objective] This paper proposes a feature fusion method for patent text classification, aiming to address the low recall issues of the existing methods, which do not utilize the unregistered words. [Methods] First, we fused the sentence vector pre-trained by BERT and the proper noun vector. Then, we used the TF-IDF value of the proper nouns as the weight assigned to the vector. [Results] We examined our model with the self-built patent text corpus. Its accuracy, recall and F1 values were 84.43%, 82.01% and 81.23% respectively. The F1 value was about 5.7% higher than other methods. [Limitations] The experimental data were mainly collected from the field of new energy vehicles, which need to be expanded. [Conclusions] The proposed method could effectively process the unbalanced data and unregistered words in patent texts.
|
Received: 19 August 2021
Published: 12 May 2022
|
|
Fund:National Natural Science Foundation of China(62171043);“Diligent Talents” Training Scheme Foundation of Beijing Information Science Technology University(QXTCP B201908) |
Corresponding Authors:
Zhang Le,ORCID:0000-0002-9620-511X
E-mail: zhangle@bistu.edu.cn
|
[1] |
陈燕, 黄迎燕, 方建国. 专利信息采集与分析[M]. 北京: 清华大学出版社, 2006.
|
[1] |
( Chen Yan, Huang Yingyan, Fang Jianguo. Patent Information Collection and Analysis[M]. Beijing: Tsinghua University Press, 2006.)
|
[2] |
王雪颖, 王昊, 张紫玄. 中文专利文献中连续符号串的语义识别[J]. 数据分析与知识发现, 2018, 2(5):11-22.
|
[2] |
( Wang Xueying, Wang Hao, Zhang Zixuan. Recognizing Semantics of Continuous Strings in Chinese Patent Documents[J]. Data Analysis and Knowledge Discovery, 2018, 2(5):11-22.)
|
[3] |
Li S B, Hu J, Cui Y X, et al. DeepPatent: Paten Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2):721-744.
doi: 10.1007/s11192-018-2905-5
|
[4] |
Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002, 34(1):1-47.
doi: 10.1145/505282.505283
|
[5] |
苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9):1848-1859.
|
[5] |
( Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization[J]. Journal of Software, 2006, 17(9):1848-1859.)
|
[6] |
Song G, Ye Y M, Du X L, et al. Short Text Classification: A Survey[J]. Journal of Multimedia, 2014, 9(5):635-643.
|
[7] |
俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9):53-59.
|
[7] |
( Yu Yan, Chen Lei, Jiang Jinde, et al. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. Data Analysis and Knowledge Discovery, 2019, 3(9):53-59.)
|
[8] |
张桂平, 刘东升, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报, 2010, 24(3):112-116.
|
[8] |
( Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al. Research on Chinese Word Segmentation Technology for Patent Documents[J]. Chinese Journal of Information Processing, 2010, 24(3):112-116.)
|
[9] |
Lewis D D. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval[C]//Proceedings of the 10th European Conference on Machine Learning. 1998: 4-15.
|
[10] |
Cover T, Hart P. Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
doi: 10.1109/TIT.1967.1053964
|
[11] |
Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
|
[12] |
Luong M T, Socher R, Manning C D. Better Word Representations with Recursive Neural Networks for Morphology[C]//Proceedings of the 17th Conference on Computational Natural Language Learning. 2013: 104-113.
|
[13] |
Zhang X, Zhao J B, LeCun Y. Character-Level Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv: 1509.01626.
|
[14] |
Kim Y. Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
|
[15] |
Howard J, Ruder S. Universal Language Model Fine-Tuning for Text Classification[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 328-339.
|
[16] |
Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018: 2227-2237.
|
[17] |
Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre- Training[OL]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
|
[18] |
Devlin J, Chang M W, Lee K, et al. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
|
[19] |
Wu S C, He Y F. Enriching Pre-Trained Language Model with Entity Information for Relation Classification[OL]. arXiv Preprint, arXiv: 1905.08284.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|