Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (4): 145-158     https://doi.org/10.11925/infotech.2096-3467.2022.0429
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
MPMFC:一种融合网络邻里结构特征和专利语义特征的中药专利分类模型*
邓娜1,何昕洋1(),陈伟杰1,陈旭2
1湖北工业大学计算机学院 武汉 430068
2中南财经政法大学信息与安全工程学院 武汉 430073
MPMFC: A Traditional Chinese Medicine Patent Classification Model Integrating Network Neighborhood Structural Features and Patent Semantic Features
Deng Na1,He Xinyang1(),Chen Weijie1,Chen Xu2
1School of Computer Science, Hubei University of Technology, Wuhan 430068, China
2School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
全文: PDF (1139 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决因中药自身的复杂性以及现有专利分类模型无法提取到充分的中药专利特征信息而导致的分类准确率不理想问题。【方法】 提出中药专利多特征融合分类模型MPMFC:基于专利核心字段的相似度信息构建中药专利相似度网络;利用Node2Vec算法从中药专利相似度网络的全局结构中捕获潜在专利间的邻里结构信息,使其映射为低维向量作为补充特征;使用注意力机制将经过RoBERTa-Tiny预训练的专利语义特征与其对应的补充特征进行特征融合,进而实现中药专利的自动化分类。【结果】 在真实的7 000条中药专利语料上,MPMFC模型的准确率、召回率和F1值分别达到0.843 6、0.801 7、0.822 1,相较于基线分类模型分别提升1.58、2.59和2.11个百分点。【局限】 构建中药专利相似度网络时分配权重具有一定的主观性,非中药科研人员在进行专利标注时会存在部分分类错误。【结论】 MPMFC模型在中药专利分类过程中能从多角度获取并学习更丰富的特征表示,从而提高分类准确性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
邓娜
何昕洋
陈伟杰
陈旭
关键词 中药专利分类专利相似度网络特征融合预训练模型Node2Vec    
Abstract

[Objective] To solve the problem of low accuracy in classification models for Traditional Chinese Medicine (TCM) patents due to the complexity of TCM and insufficient extracted information on the characteristics of TCM patents. [Methods] We proposed a classification model for TCM patents called MPMFC (Medicine Patent Multi-feature Fusion Classifier). Firstly, we constructed a TCM patent similarity network based on the similarity information of the patent core fields. Then, we used the Node2Vec algorithm to capture the neighborhood structure information of potential patents from the global structure of the TCM patent similarity network, which was mapped to low-dimensional vectors as additional features. Finally, the attention mechanism was utilized to fuse the patent semantic feature vector pre-trained by RoBERTa-Tiny with their corresponding supplementary features to classify TCM patents automatically. [Results] We examined the MPMFC model on a corpus of 7,000 TCM patents. It achieved the accuracy, recall, and F1 values of 0.8436, 0.8017, and 0.822 1, respectively, which were 1.58%, 2.59%, and 2.11% higher than the baseline classification model. [Limitations] The weight allocation when constructing the similarity network of TCM patents has subjectivity issues. There may be some classification errors when Non-TCM researchers label patents. [Conclusions] The MPMFC model can acquire and learn more comprehensive feature representations from multiple perspectives during TCM patent classification, improving classification accuracy.

Key wordsTCM Patent Classification    Patent Similarity Network    Feature Fusion    Pre-Training Model    Node2Vec
收稿日期: 2022-05-05      出版日期: 2023-06-07
ZTFLH:  G35  
基金资助:*国家自然科学基金项目的研究成果之一(61902116)
通讯作者: 何昕洋,ORCID:0000-0002-0668-0276,E-mail:hexinyang@foxmail.com   
引用本文:   
邓娜, 何昕洋, 陈伟杰, 陈旭. MPMFC:一种融合网络邻里结构特征和专利语义特征的中药专利分类模型*[J]. 数据分析与知识发现, 2023, 7(4): 145-158.
Deng Na, He Xinyang, Chen Weijie, Chen Xu. MPMFC: A Traditional Chinese Medicine Patent Classification Model Integrating Network Neighborhood Structural Features and Patent Semantic Features. Data Analysis and Knowledge Discovery, 2023, 7(4): 145-158.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0429      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I4/145
Fig.1  中药专利特征提取简要流程
Fig.2  RoBERTa-Tiny模型结构
Fig.3  特征融合层模型结构
Fig.4  MPMFC模型总体流程
类别 数量 类别 数量 类别 数量
保健品
儿科
其他
内科
动物科
292
130
338
1 870
417
口腔科
外科
皮肤科
眼科
耳鼻咽喉科
197
548
722
66
354
肿瘤
茶工艺
酒工艺
食品工艺
妇科
222
626
148
811
259
Table 1  专利数据集的类别分布
参数 参数值
hidden_size
Num attention heads
Num hidden layers
Hidden_dropout_prob
768
12
3
0.1
max_position_embeddingsHidden size
词表大小
激活函数
512
8 021
GeLU
Table 2  RoBERTa-Tiny模型参数配置
环境 配置参数
显卡
内存
操作系统
语言
运行环境
NVIDIA GeForce RTX 2080
32GB
Ubuntu 16.04 64Bit
Python3.6.10
TensorFlow1.15.0,Keras2.3.1
Table 3  实验环境参数配置
模型类型 模型名称 准确率 召回率 F1值
传统机器学习 TF-IDF + SVM 0.696 9 0.642 9 0.630 8
以Word2Vec为词嵌入的深度学习 DeepPatent[21] 0.743 4 0.675 7 0.694 8
FastText[33] 0.744 6 0.729 7 0.755 4
BiLSTM-ATT-CNN[22] 0.765 1 0.753 8 0.719 9
BiGRU+ATT+TextCNN[20] 0.777 4 0.744 7 0.747 4
基于BERT预训练模型的深度学习 BERT[24] 0.785 5 0.748 9 0.773 0
BERT+CNN 0.803 4 0.756 8 0.770 9
BERT + BiGRU-ATT 0.827 8 0.775 8 0.801 0
本文模型 MPMFC 0.843 6 0.801 7 0.822 1
Table 4  基线模型对比
Fig.5  消融实验结果
网络名称

权重参数
主IPC 名称 摘要
CMPSN(本文网络) 0.2 0.3 0.5
CMPSN2 0.2 0.5 0.3
CMPSN3 0.4 0.3 0.3
CMPSN4 0.6 0.3 0.1
CMPSN5 0.8 0.1 0.1
CMPSN6 0.1 0.1 0.8
Table 5  不同网络权重分配
Fig.6  超参数敏感性实验
[1] 赵帅眉, 宋江秀, 杜茂波, 等. 浅谈我国经典名方的专利保护现状及思考[J]. 中国中药杂志, 2019, 44(18): 4067-4071.
doi: 10.19540/j.cnki.cjcmm.20190629.305 pmid: 31872747
[1] (Zhao Shuaimei, Song Jiangxiu, Du Maobo, et al. Current Situation and Consideration of Patent Protection in Classical Representative Famous Prescriptions in China[J]. China Journal of Chinese Materia Medica, 2019, 44(18): 4067-4071.)
doi: 10.19540/j.cnki.cjcmm.20190629.305 pmid: 31872747
[2] Deng N, Fu H, Chen X. Named Entity Recognition of Traditional Chinese Medicine Patents Based on BiLSTM-CRF[J]. Wireless Communications and Mobile Computing, 2021, 2021: 1-12.
[3] 王凯, 谢小丽, 胡璇, 等. 中药防治糖尿病专利信息挖掘及其用药规律分析[J]. 中国中医药图书情报杂志, 2022, 46(6): 8-16.
[3] (Wang Kai, Xie Xiaoli, Hu Xuan, et al. Patent Information Mining and Drug Law Analysis of Traditional Chinese Medicine for the Prevention and Treatment of Diabetes[J]. Chinese Journal of Library and Information Science for Traditional Chinese Medicine, 2022, 46(6): 8-16.)
[4] 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
[4] (Liu Xiaoling, Tan Zongying. Clustering Technology Topics Based on Patent Multi-Attribute Fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 45-54.)
[5] 周成, 魏红芹. 专利价值评估与分类研究——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[5] (Zhou Cheng, Wei Hongqin. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. Data Analysis and Knowledge Discovery, 2019, 3(5): 117-124.)
[6] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
[7] 包翔, 刘桂锋, 崔靖华. 多示例多标签学习在中文专利自动分类中的应用研究[J]. 图书情报工作, 2021, 65(8): 107-113.
doi: 10.13266/j.issn.0252-3116.2021.08.011
[7] (Bao Xiang, Liu Guifeng, Cui Jinghua. Application of Multi Instance Multi Label Learning in Chinese Patent Automatic Classification[J]. Library and Information Service, 2021, 65(8): 107-113.)
doi: 10.13266/j.issn.0252-3116.2021.08.011
[8] 符川川, 陈国华, 袁勤俭. 基于机器学习的专利质量分析与分类预测研究——以区块链技术专利为例[J]. 现代情报, 2021, 41(7): 110-120.
doi: 10.3969/j.issn.1008-0821.2021.07.011
[8] (Fu Chuanchuan, Chen Guohua, Yuan Qinjian. Research on Patent Quality Analysis and Classification Forecast Based on Machine Learning—Taking Blockchain as an Example[J]. Journal of Modern Information, 2021, 41(7): 110-120.)
doi: 10.3969/j.issn.1008-0821.2021.07.011
[9] 郑永锋. 中医药专利大全[M]. 北京: 中国中医药出版社, 1994.
[9] (Zheng Yongfeng. Patent Collection of Traditional Chinese Medicine[M]. Beijing: China Press of Traditional Chinese Medicine Co., Ltd, 1994.)
[10] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. ACM, 2013: 3111-3119.
[11] Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[OL]. arXiv Preprint, arXiv: 1405.4053.
[12] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1532-1543.
[13] Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 2227-2237.
[14] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017: 6000-6010.
[15] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[16] 刘红光, 马双刚, 刘桂锋. 基于机器学习的专利文本分类算法研究综述[J]. 图书情报研究, 2016, 9(3): 79-86.
[16] (Liu Hongguang, Ma Shuanggang, Liu Guifeng. A Review of Research on Patent Document Classification Algorithms Based on Machine Learning[J]. Library and Information Studies, 2016, 9(3): 79-86.)
[17] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859.
doi: 10.1360/jos171848
[17] (Su Jinshu, Zhang Bofeng, Xu Xin. Advances in Machine Learning Based Text Categorization[J]. Journal of Software, 2006, 17(9): 1848-1859.)
doi: 10.1360/jos171848
[18] Wu C H, Ken Y, Huang T. Patent Classification System Using a New Hybrid Genetic Algorithm Support Vector Machine[J]. Applied Soft Computing, 2010, 10(4): 1164-1177.
doi: 10.1016/j.asoc.2009.11.033
[19] 廖列法, 勒孚刚, 朱亚兰. LDA模型在专利文本分类中的应用[J]. 现代情报, 2017, 37(3): 35-39.
doi: 10.3969/j.issn.1008-0821.2017.03.007
[19] (Liao Liefa, Le Fugang, Zhu Yalan. The Application of LDA Model in Patent Text Classification[J]. Journal of Modern Information, 2017, 37(3): 35-39.)
doi: 10.3969/j.issn.1008-0821.2017.03.007
[20] 胡学钢, 杨恒宇, 林耀进, 等. 基于协同过滤的专利TRIZ分类方法[J]. 情报学报, 2018, 37(5): 512-518.
[20] (Hu Xuegang, Yang Hengyu, Lin Yaojin, et al. Study on Classification of Patents Collaborative Filtering Oriented to TRIZ[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(5): 512-518.)
[21] 吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10): 75-85.
doi: 10.13266/j.issn.0252-3116.2020.10.009
[21] (Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10): 75-85.)
doi: 10.13266/j.issn.0252-3116.2020.10.009
[22] Li S B, Hu J, Cui Y X, et al. DeepPatent: Patent Classification with Convolutional Neural Networks and Word Embedding[J]. Scientometrics, 2018, 117(2): 721-744.
doi: 10.1007/s11192-018-2905-5
[23] 马建红, 王瑞杨, 姚爽, 等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018, 44(10): 209-214.
[23] (Ma Jianhong, Wang Ruiyang, Yao Shuang, et al. Patent Classification Method Based on Depth Learning[J]. Computer Engineering, 2018, 44(10): 209-214.)
[24] 温超东, 曾诚, 任俊伟, 等. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
doi: 10.11772/j.issn.1001-9081.2020050730
[24] (Wen Chaodong, Zeng Cheng, Ren Junwei, et al. Patent Text Classification Based on ALBERT and Bidirectional Gated Recurrent Unit[J]. Journal of Computer Applications, 2021, 41(2): 407-412.)
doi: 10.11772/j.issn.1001-9081.2020050730
[25] 佟昕瑀, 赵蕊洁, 路永和. 基于预训练模型的多标签专利分类研究[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[25] (Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-Label Patent Classification with Pre-Training Model[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 129-137.)
[26] Lee J S, Hsiang J. Patent Classification by Fine-Tuning BERT Language Model[J]. World Patent Information, 2020, 61: 101965.
doi: 10.1016/j.wpi.2020.101965
[27] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[28] Grover A, Leskovec J. Node2Vec:Scalable Feature Learning for Networks[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016: 855-864.
[29] 孙海生. 文献耦合网络与同被引网络比较实证研究——以Scientometrics载文为例[J]. 现代情报, 2019, 39(4): 134-142.
doi: 10.3969/j.issn.1008-0821.2019.04.016
[29] (Sun Haisheng. Empirical Research Comparison of Bibliographic Coupling Network and Co-Citation Network—A Case Study of Articles Published in Scientometrics[J]. Journal of Modern Information, 2019, 39(4): 134-142.)
doi: 10.3969/j.issn.1008-0821.2019.04.016
[30] Xu L, Zhang X W, Dong Q Q. CLUECorpus2020: A Large-Scale Chinese Corpus for Pre-Training Language Model[OL]. arXiv Preprint, arXiv: 2003.01355.
[31] Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[32] Miech A, Laptev I, Sivic J. Learnable Pooling with Context Gating for Video Classification[OL]. arXiv Preprint, arXiv: 1706.06905.
[33] Yadrintsev V, Bakarov A, Suvorov R, et al. Fast and Accurate Patent Classification in Search Engines[J]. Journal of Physics Conference Series, 2018, 1117(1): 012004.
[1] 潘华莉, 谢珺, 高婧, 续欣莹, 王长征. 融合多模态特征的深度强化学习推荐模型*[J]. 数据分析与知识发现, 2023, 7(4): 114-128.
[2] 赵朝阳, 朱贵波, 王金桥. ChatGPT给语言大模型带来的启示和多模态大模型新的发展思路*[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[3] 钱力, 刘熠, 张智雄, 李雪思, 谢靖, 许钦亚, 黎洋, 管铮懿, 李西雨, 文森. ChatGPT的技术基础分析*[J]. 数据分析与知识发现, 2023, 7(3): 6-15.
[4] 杨文丽, 李娜娜. 基于对抗网络的文本对齐跨语言情感分类方法*[J]. 数据分析与知识发现, 2022, 6(7): 141-151.
[5] 肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[6] 佟昕瑀, 赵蕊洁, 路永和. 基于预训练模型的多标签专利分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[7] 胡忠义,张硕果,吴江. 基于URL多粒度特征融合的钓鱼网站识别*[J]. 数据分析与知识发现, 2022, 6(11): 103-110.
[8] 谢星雨, 余本功. 基于MFFMB的电商评论文本分类研究*[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[9] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[10] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[11] 陈星月, 倪丽萍, 倪志伟. 基于ELECTRA模型与词性特征的金融事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[12] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[13] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[14] 林克柔,王昊,龚丽娟,张宝隆. 融合多特征的中文论文同名学者消歧研究 *[J]. 数据分析与知识发现, 2021, 5(4): 90-102.
[15] 王雨竹,谢珺,陈波,续欣莹. 基于跨模态上下文感知注意力的多模态情感分析 *[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn