Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (4): 145-158    DOI: 10.11925/infotech.2096-3467.2022.0429
Current Issue | Archive | Adv Search |
MPMFC: A Traditional Chinese Medicine Patent Classification Model Integrating Network Neighborhood Structural Features and Patent Semantic Features
Deng Na1,He Xinyang1(),Chen Weijie1,Chen Xu2
1School of Computer Science, Hubei University of Technology, Wuhan 430068, China
2School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
Download: PDF (1139 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To solve the problem of low accuracy in classification models for Traditional Chinese Medicine (TCM) patents due to the complexity of TCM and insufficient extracted information on the characteristics of TCM patents. [Methods] We proposed a classification model for TCM patents called MPMFC (Medicine Patent Multi-feature Fusion Classifier). Firstly, we constructed a TCM patent similarity network based on the similarity information of the patent core fields. Then, we used the Node2Vec algorithm to capture the neighborhood structure information of potential patents from the global structure of the TCM patent similarity network, which was mapped to low-dimensional vectors as additional features. Finally, the attention mechanism was utilized to fuse the patent semantic feature vector pre-trained by RoBERTa-Tiny with their corresponding supplementary features to classify TCM patents automatically. [Results] We examined the MPMFC model on a corpus of 7,000 TCM patents. It achieved the accuracy, recall, and F1 values of 0.8436, 0.8017, and 0.822 1, respectively, which were 1.58%, 2.59%, and 2.11% higher than the baseline classification model. [Limitations] The weight allocation when constructing the similarity network of TCM patents has subjectivity issues. There may be some classification errors when Non-TCM researchers label patents. [Conclusions] The MPMFC model can acquire and learn more comprehensive feature representations from multiple perspectives during TCM patent classification, improving classification accuracy.

Key wordsTCM Patent Classification      Patent Similarity Network      Feature Fusion      Pre-Training Model      Node2Vec     
Received: 05 May 2022      Published: 07 June 2023
ZTFLH:  G35  
Fund:National Natural Science Foundation of China(61902116)
Corresponding Authors: He Xinyang,ORCID:0000-0002-0668-0276,E-mail:hexinyang@foxmail.com   

Cite this article:

Deng Na, He Xinyang, Chen Weijie, Chen Xu. MPMFC: A Traditional Chinese Medicine Patent Classification Model Integrating Network Neighborhood Structural Features and Patent Semantic Features. Data Analysis and Knowledge Discovery, 2023, 7(4): 145-158.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0429     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I4/145

Process of TCM Patent Feature Extraction
RoBERTa-Tiny Structure
Feature Fusion Layer Model Structure
The Overall Process of MPMFC Model
类别 数量 类别 数量 类别 数量
保健品
儿科
其他
内科
动物科
292
130
338
1 870
417
口腔科
外科
皮肤科
眼科
耳鼻咽喉科
197
548
722
66
354
肿瘤
茶工艺
酒工艺
食品工艺
妇科
222
626
148
811
259
Patent Dataset Category
参数 参数值
hidden_size
Num attention heads
Num hidden layers
Hidden_dropout_prob
768
12
3
0.1
max_position_embeddingsHidden size
词表大小
激活函数
512
8 021
GeLU
RoBERTa-Tiny Model Parameter Configuration
环境 配置参数
显卡
内存
操作系统
语言
运行环境
NVIDIA GeForce RTX 2080
32GB
Ubuntu 16.04 64Bit
Python3.6.10
TensorFlow1.15.0,Keras2.3.1
Experimental Parameter Configuration
模型类型 模型名称 准确率 召回率 F1值
传统机器学习 TF-IDF + SVM 0.696 9 0.642 9 0.630 8
以Word2Vec为词嵌入的深度学习 DeepPatent[21] 0.743 4 0.675 7 0.694 8
FastText[33] 0.744 6 0.729 7 0.755 4
BiLSTM-ATT-CNN[22] 0.765 1 0.753 8 0.719 9
BiGRU+ATT+TextCNN[20] 0.777 4 0.744 7 0.747 4
基于BERT预训练模型的深度学习 BERT[24] 0.785 5 0.748 9 0.773 0
BERT+CNN 0.803 4 0.756 8 0.770 9
BERT + BiGRU-ATT 0.827 8 0.775 8 0.801 0
本文模型 MPMFC 0.843 6 0.801 7 0.822 1
Baseline Model Comparison
Ablation Experiment Results
网络名称

权重参数
主IPC 名称 摘要
CMPSN(本文网络) 0.2 0.3 0.5
CMPSN2 0.2 0.5 0.3
CMPSN3 0.4 0.3 0.3
CMPSN4 0.6 0.3 0.1
CMPSN5 0.8 0.1 0.1
CMPSN6 0.1 0.1 0.8
Weight Distribution of Different Networks
Superparametric Sensitivity Test
[1] 赵帅眉, 宋江秀, 杜茂波, 等. 浅谈我国经典名方的专利保护现状及思考[J]. 中国中药杂志, 2019, 44(18): 4067-4071.
doi: 10.19540/j.cnki.cjcmm.20190629.305 pmid: 31872747
[1] (Zhao Shuaimei, Song Jiangxiu, Du Maobo, et al. Current Situation and Consideration of Patent Protection in Classical Representative Famous Prescriptions in China[J]. China Journal of Chinese Materia Medica, 2019, 44(18): 4067-4071.)
doi: 10.19540/j.cnki.cjcmm.20190629.305 pmid: 31872747
[2] Deng N, Fu H, Chen X. Named Entity Recognition of Traditional Chinese Medicine Patents Based on BiLSTM-CRF[J]. Wireless Communications and Mobile Computing, 2021, 2021: 1-12.
[3] 王凯, 谢小丽, 胡璇, 等. 中药防治糖尿病专利信息挖掘及其用药规律分析[J]. 中国中医药图书情报杂志, 2022, 46(6): 8-16.
[3] (Wang Kai, Xie Xiaoli, Hu Xuan, et al. Patent Information Mining and Drug Law Analysis of Traditional Chinese Medicine for the Prevention and Treatment of Diabetes[J]. Chinese Journal of Library and Information Science for Traditional Chinese Medicine, 2022, 46(6): 8-16.)
[4] 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
[4] (Liu Xiaoling, Tan Zongying. Clustering Technology Topics Based on Patent Multi-Attribute Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 45-54.)
[5] 周成, 魏红芹. 专利价值评估与分类研究——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[5] (Zhou Cheng, Wei Hongqin. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. Data Analysis and Knowledge Discovery, 2019, 3(5): 117-124.)
[6] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
[7] 包翔, 刘桂锋, 崔靖华. 多示例多标签学习在中文专利自动分类中的应用研究[J]. 图书情报工作, 2021, 65(8): 107-113.
doi: 10.13266/j.issn.0252-3116.2021.08.011
[7] (Bao Xiang, Liu Guifeng, Cui Jinghua. Application of Multi Instance Multi Label Learning in Chinese Patent Automatic Classification[J]. Library and Information Service, 2021, 65(8): 107-113.)
doi: 10.13266/j.issn.0252-3116.2021.08.011
[8] 符川川, 陈国华, 袁勤俭. 基于机器学习的专利质量分析与分类预测研究——以区块链技术专利为例[J]. 现代情报, 2021, 41(7): 110-120.
doi: 10.3969/j.issn.1008-0821.2021.07.011
[8] (Fu Chuanchuan, Chen Guohua, Yuan Qinjian. Research on Patent Quality Analysis and Classification Forecast Based on Machine Learning—Taking Blockchain as an Example[J]. Journal of Modern Information, 2021, 41(7): 110-120.)
doi: 10.3969/j.issn.1008-0821.2021.07.011
[9] 郑永锋. 中医药专利大全[M]. 北京: 中国中医药出版社, 1994.
[9] (Zheng Yongfeng. Patent Collection of Traditional Chinese Medicine[M]. Beijing: China Press of Traditional Chinese Medicine Co., Ltd, 1994.)
[10] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. ACM, 2013: 3111-3119.
[11] Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[OL]. arXiv Preprint, arXiv: 1405.4053.
[12] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1532-1543.
[13] Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 2227-2237.
[14] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017: 6000-6010.
[15] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[16] 刘红光, 马双刚, 刘桂锋. 基于机器学习的专利文本分类算法研究综述[J]. 图书情报研究, 2016, 9(3): 79-86.
[16] (Liu Hongguang, Ma Shuanggang, Liu Guifeng. A Review of Research on Patent Document Classification Algorithms Based on Machine Learning[J]. Library and Information Studies, 2016, 9(3): 79-86.)
[17] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859.
doi: 10.1360/jos171848
[17] (Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization[J]. Journal of Software, 2006, 17(9): 1848-1859.)
doi: 10.1360/jos171848
[18] Wu C H, Ken Y, Huang T. Patent Classification System Using a New Hybrid Genetic Algorithm Support Vector Machine[J]. Applied Soft Computing, 2010, 10(4): 1164-1177.
doi: 10.1016/j.asoc.2009.11.033
[19] 廖列法, 勒孚刚, 朱亚兰. LDA模型在专利文本分类中的应用[J]. 现代情报, 2017, 37(3): 35-39.
doi: 10.3969/j.issn.1008-0821.2017.03.007
[19] (Liao Liefa, Le Fugang, Zhu Yalan. The Application of LDA Model in Patent Text Classification[J]. Journal of Modern Information, 2017, 37(3): 35-39.)
doi: 10.3969/j.issn.1008-0821.2017.03.007
[20] 胡学钢, 杨恒宇, 林耀进, 等. 基于协同过滤的专利TRIZ分类方法[J]. 情报学报, 2018, 37(5): 512-518.
[20] (Hu Xuegang, Yang Hengyu, Lin Yaojin, et al. Study on Classification of Patents Collaborative Filtering Oriented to TRIZ[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(5): 512-518.)
[21] 吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10): 75-85.
doi: 10.13266/j.issn.0252-3116.2020.10.009
[21] (Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10): 75-85.)
doi: 10.13266/j.issn.0252-3116.2020.10.009
[22] Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2): 721-744.
doi: 10.1007/s11192-018-2905-5
[23] 马建红, 王瑞杨, 姚爽, 等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018, 44(10): 209-214.
[23] (Ma Jianhong, Wang Ruiyang, Yao Shuang, et al. Patent Classification Method Based on Depth Learning[J]. Computer Engineering, 2018, 44(10): 209-214.)
[24] 温超东, 曾诚, 任俊伟, 等. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
doi: 10.11772/j.issn.1001-9081.2020050730
[24] (Wen Chaodong, Zeng Cheng, Ren Junwei, et al. Patent Text Classification Based on ALBERT and Bidirectional Gated Recurrent Unit[J]. Journal of Computer Applications, 2021, 41(2): 407-412.)
doi: 10.11772/j.issn.1001-9081.2020050730
[25] 佟昕瑀, 赵蕊洁, 路永和. 基于预训练模型的多标签专利分类研究[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[25] (Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-Label Patent Classification with Pre-Training Model[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 129-137.)
[26] Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61: 101965.
doi: 10.1016/j.wpi.2020.101965
[27] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[28] Grover A, Leskovec J. Node2Vec:Scalable Feature Learning for Networks[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016: 855-864.
[29] 孙海生. 文献耦合网络与同被引网络比较实证研究——以Scientometrics载文为例[J]. 现代情报, 2019, 39(4): 134-142.
doi: 10.3969/j.issn.1008-0821.2019.04.016
[29] (Sun Haisheng. Empirical Research Comparison of Bibliographic Coupling Network and Co-Citation Network—A Case Study of Articles Published in Scientometrics[J]. Journal of Modern Information, 2019, 39(4): 134-142.)
doi: 10.3969/j.issn.1008-0821.2019.04.016
[30] Xu L, Zhang X W, Dong Q Q. CLUECorpus2020: A Large-Scale Chinese Corpus for Pre-Training Language Model[OL]. arXiv Preprint, arXiv: 2003.01355.
[31] Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[32] Miech A, Laptev I, Sivic J. Learnable Pooling with Context Gating for Video Classification[OL]. arXiv Preprint, arXiv: 1706.06905.
[33] Yadrintsev V, Bakarov A, Suvorov R, et al. Fast and Accurate Patent Classification in Search Engines[J]. Journal of Physics Conference Series, 2018, 1117(1): 012004.
[1] Pan Huali, Xie Jun, Gao Jing, Xu Xinying, Wang Changzheng. A Deep Reinforcement Learning Recommendation Model with Multi-modal Features[J]. 数据分析与知识发现, 2023, 7(4): 114-128.
[2] Qian Li, Liu Yi, Zhang Zhixiong, Li Xuesi, Xie Jing, Xu Qinya, Li Yang, Guan Zhengyi, Li Xiyu, Wen Sen. An Analysis on the Basic Technologies of ChatGPT[J]. 数据分析与知识发现, 2023, 7(3): 6-15.
[3] Yang Wenli, Li Nana. A Text-Aligned Cross-Language Sentiment Classification Method Based on Adversarial Networks[J]. 数据分析与知识发现, 2022, 6(7): 141-151.
[4] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[5] Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-label Patent Classification with Pre-training Model[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[6] Hu Zhongyi,Zhang Shuoguo,Wu Jiang. Identifying Phishing Websites Based on URL Multi-Granularity Feature Fusion[J]. 数据分析与知识发现, 2022, 6(11): 103-110.
[7] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[8] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[9] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[10] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[11] Wang Yuzhu,Xie Jun,Chen Bo,Xu Xinying. Multi-modal Sentiment Analysis Based on Cross-modal Context-aware Attention[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[12] Lin Kerou,Wang Hao,Gong Lijuan,Zhang Baolong. Disambiguation of Chinese Author Names with Multiple Features[J]. 数据分析与知识发现, 2021, 5(4): 90-102.
[13] Han Pu, Zhang Wei, Zhang Zhanpeng, Wang Yuxin, Fang Haoyu. Sentiment Analysis of Weibo Posts on Public Health Emergency with Feature Fusion and Multi-Channel[J]. 数据分析与知识发现, 2021, 5(11): 68-79.
[14] Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[15] Li Junlian,Wu Yingjie,Deng Panpan,Leng Fuhai. Automatic Data Processing Strategy of Citation Anomie Based on Feature Fusion[J]. 数据分析与知识发现, 2020, 4(5): 38-45.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn