Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (4): 49-59    DOI: 10.11925/infotech.2096-3467.2021.0852
Current Issue | Archive | Adv Search |
Classifying Chinese Patent Texts with Feature Fusion
Xiao Yuejun1,2,Li Honglian1,Zhang Le2(),Lv Xueqiang2,You Xindong2
1School of Information & Communication Engineering, Beijing Information Science & Technology University, Beijing 100101, China
2Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science & Technology University, Beijing 100101, China
Download: PDF (1186 KB)   HTML ( 76
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a feature fusion method for patent text classification, aiming to address the low recall issues of the existing methods, which do not utilize the unregistered words. [Methods] First, we fused the sentence vector pre-trained by BERT and the proper noun vector. Then, we used the TF-IDF value of the proper nouns as the weight assigned to the vector. [Results] We examined our model with the self-built patent text corpus. Its accuracy, recall and F1 values were 84.43%, 82.01% and 81.23% respectively. The F1 value was about 5.7% higher than other methods. [Limitations] The experimental data were mainly collected from the field of new energy vehicles, which need to be expanded. [Conclusions] The proposed method could effectively process the unbalanced data and unregistered words in patent texts.

Key wordsPatent      Text Classification      Feature Fusion      BERT      TF-IDF     
Received: 19 August 2021      Published: 12 May 2022
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(62171043);“Diligent Talents” Training Scheme Foundation of Beijing Information Science Technology University(QXTCP B201908)
Corresponding Authors: Zhang Le,ORCID:0000-0002-9620-511X     E-mail: zhangle@bistu.edu.cn

Cite this article:

Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion. Data Analysis and Knowledge Discovery, 2022, 6(4): 49-59.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0852     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I4/49

Example of Original Sentence
Schematic Diagram of BERT Model
Data Example
Patent Text Classification Method Based on Feature Fusion
分类号 类别 分类号 类别
B01 物理化学方法装置 B25 手动工具;车间设备
B02 破碎,磨粉或粉碎 B26 手动切割工具
B05 一般喷射或雾化 B28 加工水泥黏土石料
B07 将固体分离:分选 B29 塑料的加工
B08 清洁 B60 一般车辆
B21 无切削的金属机械 B62 无轨路用车辆
B22 铸造;粉末冶金 B63 船;与船有关的设备
B23 机床无类金属加工 B65 输送;包装;贮藏
B24 磨削;抛光 B66 卷扬;提升;牵引
The Specific Categories of Part B
分类号 专利数 分类号 专利数 分类号 专利数
B01 751 B22 122 B29 189
B02 231 B23 1 007 B60 10 104
B05 247 B24 199 B62 1 126
B07 146 B25 441 B63 116
B08 389 B26 114 B65 1 056
B21 361 B28 107 B66 301
The Distribution of Patents in Part B
分类号 专利数 对数
比值
目标
专利数
分类号 专利数 对数
比值
目标
专利数
B01 751 2.875 6 912 B25 441 2.644 4 838
B02 231 2.363 6 462 B26 114 2.056 9 228
B05 247 2.392 7 494 B28 107 2.029 4 214
B07 146 2.164 4 292 B29 189 2.276 5 378
B08 389 2.589 9 778 B60 10 104 4.004 5 1 269
B21 361 2.557 5 722 B62 1126 3.051 5 967
B22 122 2.086 4 244 B63 116 2.064 5 232
B23 1 007 3.003 0 952 B65 1056 3.023 7 959
B24 199 2.298 9 398 B66 301 2.478 6 602
Data Set Division Table of Various Categories
分类号 训练 验证 测试 分类号 训练 验证 测试
B01 775 91 46 B25 712 84 42
B02 393 46 23 B26 194 23 11
B05 420 49 25 B28 182 21 11
B07 248 29 15 B29 321 38 19
B08 661 78 39 B60 1 079 127 1 039
B21 614 72 36 B62 822 97 116
B22 208 24 12 B63 197 23 12
B23 809 95 102 B65 815 96 109
B24 338 40 20 B66 512 60 30
Data Set Division of Various Categories
Logarithmic Graph of Original Data and Training Set Data Distribution
参数 取值
optimizer adam
num_hidden_layers 12
learning rate 2e-5
dropout 0.1
patience 10
BERT_embdding 768
max_position_embedding 512
Hyperparameter Settings
序号 方法 A/% P/% R/% F 1/%
1 LSTM+Attention 80.97 69.86 65.33 68.31
2 CNN 80.86 68.52 66.63 67.56
3 BERT+Softmax 82.37 76.31 74.73 75.51
4 本文方法 84.43 80.78 82.01 81.23
Comparative Experimental Results of Patent Text Classification
消融实验 A/% P/% R/% F 1/%
消融实验1:BERT+Softmax 82.37 76.31 74.73 75.51
消融实验2:融合专有名词特征 83.98 77.59 79.32 78.24
本文方法 84.43 80.78 82.01 81.23
Results of Patent Text Classification and Ablation
数据集 数据集大小/篇 新增专有
名词数/个
每篇平均新提取
词数/个
训练集 9 300 28 450 3.06
验证集 1 093 866 0.79
测试集 1 707 2 230 1.31
总计 12 100 31 546 2.61
Proper Noun Extraction
Specific Example
[1] 陈燕, 黄迎燕, 方建国. 专利信息采集与分析[M]. 北京: 清华大学出版社, 2006.
[1] ( Chen Yan, Huang Yingyan, Fang Jianguo. Patent Information Collection and Analysis[M]. Beijing: Tsinghua University Press, 2006.)
[2] 王雪颖, 王昊, 张紫玄. 中文专利文献中连续符号串的语义识别[J]. 数据分析与知识发现, 2018, 2(5):11-22.
[2] ( Wang Xueying, Wang Hao, Zhang Zixuan. Recognizing Semantics of Continuous Strings in Chinese Patent Documents[J]. Data Analysis and Knowledge Discovery, 2018, 2(5):11-22.)
[3] Li S B, Hu J, Cui Y X, et al. DeepPatent: Paten Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2):721-744.
doi: 10.1007/s11192-018-2905-5
[4] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002, 34(1):1-47.
doi: 10.1145/505282.505283
[5] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9):1848-1859.
[5] ( Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization[J]. Journal of Software, 2006, 17(9):1848-1859.)
[6] Song G, Ye Y M, Du X L, et al. Short Text Classification: A Survey[J]. Journal of Multimedia, 2014, 9(5):635-643.
[7] 俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9):53-59.
[7] ( Yu Yan, Chen Lei, Jiang Jinde, et al. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. Data Analysis and Knowledge Discovery, 2019, 3(9):53-59.)
[8] 张桂平, 刘东升, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报, 2010, 24(3):112-116.
[8] ( Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al. Research on Chinese Word Segmentation Technology for Patent Documents[J]. Chinese Journal of Information Processing, 2010, 24(3):112-116.)
[9] Lewis D D. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval[C]//Proceedings of the 10th European Conference on Machine Learning. 1998: 4-15.
[10] Cover T, Hart P. Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
doi: 10.1109/TIT.1967.1053964
[11] Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
[12] Luong M T, Socher R, Manning C D. Better Word Representations with Recursive Neural Networks for Morphology[C]//Proceedings of the 17th Conference on Computational Natural Language Learning. 2013: 104-113.
[13] Zhang X, Zhao J B, LeCun Y. Character-Level Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv: 1509.01626.
[14] Kim Y. Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[15] Howard J, Ruder S. Universal Language Model Fine-Tuning for Text Classification[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 328-339.
[16] Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018: 2227-2237.
[17] Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre- Training[OL]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[18] Devlin J, Chang M W, Lee K, et al. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[19] Wu S C, He Y F. Enriching Pre-Trained Language Model with Entity Information for Relation Classification[OL]. arXiv Preprint, arXiv: 1905.08284.
[1] Pan Huiping, Li Baoan, Zhang Le, Lv Xueqiang. Extracting Keywords from Government Work Reports with Multi-feature Fusion[J]. 数据分析与知识发现, 2022, 6(5): 54-63.
[2] Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[3] Guan Peng, Wang Yuefen, Fu Zhu, Jin Jialin. Ide.pngying R&D Teams and Innovations with Patent Collaboration Networks[J]. 数据分析与知识发现, 2022, 6(5): 99-111.
[4] Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[5] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[6] Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[7] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[8] Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-label Patent Classification with Pre-training Model[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[9] Fu Zhu, Ding Weike, Guan Peng, Ding Xuhui. Knowledge Description Framework for Foreign Patent Documents Based on Knowledge Meta[J]. 数据分析与知识发现, 2022, 6(2/3): 263-273.
[10] Wang Yongsheng, Wang Hao, Yu Wei, Zhou Zeyu. Extracting Relationship Among Characters from Local Chronicles with Text Structures and Contents[J]. 数据分析与知识发现, 2022, 6(2/3): 318-328.
[11] Liu Xiaoling, Tan Zongying. Clustering Technology Topics Based on Patent Multi-Attribute Fusion[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
[12] Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[13] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[14] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[15] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn