MPMFC: A Traditional Chinese Medicine Patent Classification Model Integrating Network Neighborhood Structural Features and Patent Semantic Features
Deng Na1,He Xinyang1(),Chen Weijie1,Chen Xu2
1School of Computer Science, Hubei University of Technology, Wuhan 430068, China 2School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
[Objective] To solve the problem of low accuracy in classification models for Traditional Chinese Medicine (TCM) patents due to the complexity of TCM and insufficient extracted information on the characteristics of TCM patents. [Methods] We proposed a classification model for TCM patents called MPMFC (Medicine Patent Multi-feature Fusion Classifier). Firstly, we constructed a TCM patent similarity network based on the similarity information of the patent core fields. Then, we used the Node2Vec algorithm to capture the neighborhood structure information of potential patents from the global structure of the TCM patent similarity network, which was mapped to low-dimensional vectors as additional features. Finally, the attention mechanism was utilized to fuse the patent semantic feature vector pre-trained by RoBERTa-Tiny with their corresponding supplementary features to classify TCM patents automatically. [Results] We examined the MPMFC model on a corpus of 7,000 TCM patents. It achieved the accuracy, recall, and F1 values of 0.8436, 0.8017, and 0.822 1, respectively, which were 1.58%, 2.59%, and 2.11% higher than the baseline classification model. [Limitations] The weight allocation when constructing the similarity network of TCM patents has subjectivity issues. There may be some classification errors when Non-TCM researchers label patents. [Conclusions] The MPMFC model can acquire and learn more comprehensive feature representations from multiple perspectives during TCM patent classification, improving classification accuracy.
邓娜, 何昕洋, 陈伟杰, 陈旭. MPMFC:一种融合网络邻里结构特征和专利语义特征的中药专利分类模型*[J]. 数据分析与知识发现, 2023, 7(4): 145-158.
Deng Na, He Xinyang, Chen Weijie, Chen Xu. MPMFC: A Traditional Chinese Medicine Patent Classification Model Integrating Network Neighborhood Structural Features and Patent Semantic Features. Data Analysis and Knowledge Discovery, 2023, 7(4): 145-158.
(Zhao Shuaimei, Song Jiangxiu, Du Maobo, et al. Current Situation and Consideration of Patent Protection in Classical Representative Famous Prescriptions in China[J]. China Journal of Chinese Materia Medica, 2019, 44(18): 4067-4071.)
doi: 10.19540/j.cnki.cjcmm.20190629.305
pmid: 31872747
[2]
Deng N, Fu H, Chen X. Named Entity Recognition of Traditional Chinese Medicine Patents Based on BiLSTM-CRF[J]. Wireless Communications and Mobile Computing, 2021, 2021: 1-12.
(Wang Kai, Xie Xiaoli, Hu Xuan, et al. Patent Information Mining and Drug Law Analysis of Traditional Chinese Medicine for the Prevention and Treatment of Diabetes[J]. Chinese Journal of Library and Information Science for Traditional Chinese Medicine, 2022, 46(6): 8-16.)
(Liu Xiaoling, Tan Zongying. Clustering Technology Topics Based on Patent Multi-Attribute Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 45-54.)
(Zhou Cheng, Wei Hongqin. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. Data Analysis and Knowledge Discovery, 2019, 3(5): 117-124.)
[6]
Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
(Bao Xiang, Liu Guifeng, Cui Jinghua. Application of Multi Instance Multi Label Learning in Chinese Patent Automatic Classification[J]. Library and Information Service, 2021, 65(8): 107-113.)
doi: 10.13266/j.issn.0252-3116.2021.08.011
(Fu Chuanchuan, Chen Guohua, Yuan Qinjian. Research on Patent Quality Analysis and Classification Forecast Based on Machine Learning—Taking Blockchain as an Example[J]. Journal of Modern Information, 2021, 41(7): 110-120.)
doi: 10.3969/j.issn.1008-0821.2021.07.011
[9]
郑永锋. 中医药专利大全[M]. 北京: 中国中医药出版社, 1994.
[9]
(Zheng Yongfeng. Patent Collection of Traditional Chinese Medicine[M]. Beijing: China Press of Traditional Chinese Medicine Co., Ltd, 1994.)
[10]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. ACM, 2013: 3111-3119.
[11]
Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[OL]. arXiv Preprint, arXiv: 1405.4053.
[12]
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1532-1543.
[13]
Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 2227-2237.
[14]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017: 6000-6010.
[15]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
(Liu Hongguang, Ma Shuanggang, Liu Guifeng. A Review of Research on Patent Document Classification Algorithms Based on Machine Learning[J]. Library and Information Studies, 2016, 9(3): 79-86.)
(Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization[J]. Journal of Software, 2006, 17(9): 1848-1859.)
doi: 10.1360/jos171848
[18]
Wu C H, Ken Y, Huang T. Patent Classification System Using a New Hybrid Genetic Algorithm Support Vector Machine[J]. Applied Soft Computing, 2010, 10(4): 1164-1177.
doi: 10.1016/j.asoc.2009.11.033
(Liao Liefa, Le Fugang, Zhu Yalan. The Application of LDA Model in Patent Text Classification[J]. Journal of Modern Information, 2017, 37(3): 35-39.)
doi: 10.3969/j.issn.1008-0821.2017.03.007
(Hu Xuegang, Yang Hengyu, Lin Yaojin, et al. Study on Classification of Patents Collaborative Filtering Oriented to TRIZ[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(5): 512-518.)
(Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10): 75-85.)
doi: 10.13266/j.issn.0252-3116.2020.10.009
[22]
Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2): 721-744.
doi: 10.1007/s11192-018-2905-5
(Wen Chaodong, Zeng Cheng, Ren Junwei, et al. Patent Text Classification Based on ALBERT and Bidirectional Gated Recurrent Unit[J]. Journal of Computer Applications, 2021, 41(2): 407-412.)
doi: 10.11772/j.issn.1001-9081.2020050730
(Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-Label Patent Classification with Pre-Training Model[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 129-137.)
[26]
Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61: 101965.
doi: 10.1016/j.wpi.2020.101965
[27]
Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[28]
Grover A, Leskovec J. Node2Vec:Scalable Feature Learning for Networks[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016: 855-864.
(Sun Haisheng. Empirical Research Comparison of Bibliographic Coupling Network and Co-Citation Network—A Case Study of Articles Published in Scientometrics[J]. Journal of Modern Information, 2019, 39(4): 134-142.)
doi: 10.3969/j.issn.1008-0821.2019.04.016
[30]
Xu L, Zhang X W, Dong Q Q. CLUECorpus2020: A Large-Scale Chinese Corpus for Pre-Training Language Model[OL]. arXiv Preprint, arXiv: 2003.01355.
[31]
Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[32]
Miech A, Laptev I, Sivic J. Learnable Pooling with Context Gating for Video Classification[OL]. arXiv Preprint, arXiv: 1706.06905.
[33]
Yadrintsev V, Bakarov A, Suvorov R, et al. Fast and Accurate Patent Classification in Search Engines[J]. Journal of Physics Conference Series, 2018, 1117(1): 012004.