数据分析与知识发现  2021, Vol. 5 Issue (9): 75-84
1中国医学科学院/北京协和医学院医学信息研究所/图书馆 北京 100020
2中央财经大学金融学院 北京 102206
3厦门大学信息学院 厦门 361005
Classification Model for Medical Entity Relations with Convolutional Neural Network
Fan Shaoping1,Zhao Yuxuan2,An Xinying1,Wu Qingqiang3()
1Institute of Medical Information / Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100020, China
2School of Finance, Central University of Finance and Economics, Beijing 102206, China
3School of Informatics, Xiamen University, Xiamen 361005, China
【目的】 为提升关系分类模型性能,降低特征计算复杂性,提出一种融合多特征嵌入的卷积神经网络实体关系分类模型。【方法】 参考已有研究的主要嵌入特征,提出融合位置和词汇级特征嵌入的卷积神经网络实体关系分类模型,并给出特征的计算表示方法,上述特征无需复杂计算算法,提高了模型性能。【结果】 所提模型在生物医学领域语料库AIMed、GENIA和ChemProt上F1值分别为0.734 2、0.976 4和0.890 0,在GENIA和ChemProt上实现了当前最佳性能。【局限】 尚未融入生物医学领域先验知识等领域特色的特征。【结论】 融合多特征嵌入的卷积神经网络实体关系分类模型具有良好的分类效果,可为生物医学领域关系抽取和知识库研究提供参考。

关键词 关系分类卷积神经网络位置特征词汇级特征    

[Objective] This paper proposes a new classification model for entity relationship based on the Convolutional Neural Network (CNN) with multi-features embedding, aiming to improve the classification results and simplify feature calculation. [Objective] Based on the existing algorithms of embedded features, our CNN model integrated word positions and lexical features, as well as demonstrated the representation methods for the features. These features did not require complex algorithm calculation, which improved the model's performance. [Results] We examined the proposed model with the Bio-Medical corpus of AIMed, GENIA and ChemProt. The F1 scores were 0.7342, 0.9764 and 0.8900, respectively. This model yielded the best results with the GENIA and ChemProt datasets. [Limitations] Our model did not include the prior domain knowledge from biomedical field. [Conclusions] The proposed model could effectively conduct entity relationship classification, which also help the research on relation extraction and knowledgebase construction in bio-medical field.

Key wordsRelation Classification    CNN    Position Features    Lexical Features
收稿日期: 2021-01-07      出版日期: 2021-06-29
ZTFLH:  分类号: G350  
通讯作者: 吴清强     E-mail:
范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
Fan Shaoping,Zhao Yuxuan,An Xinying,Wu Qingqiang. Classification Model for Medical Entity Relations with Convolutional Neural Network. Data Analysis and Knowledge Discovery, 2021, 5(9): 75-84.
句子 实体e1 实体e2 关系:
<e1>1,25D</e1> inhibited <e2>MYC gene</e2> expression and accelerated its protein turnover 1,25D MYC gene inhibit (e1, e2
Table 1  实体关系分类示例
句子 实体e1 实体e2
Demethylation experiments further confirmed that loss of <e1>ALX4</e1> expression was regulated by <e2>CpG island</e2> hypermethylation. ALX4 CpG
Table 2  词汇级特征示例
Fig.1  卷积神经网络结构层次示意图
Fig.2  仅词表示的CNN结构
Fig.3  加入位置特征的CNN结构
Fig.4  加入词汇级特征的CNN结构
语料库 关系名称 关系语句数量 训练集 测试集
AIMed[35] False 4 834 4 861 973
True 1 000
GENIA[36] Protein-Component 1 302 1 547 310
Subunit-Complex 555
ChemProt[37] Activator 2 571 5 363 1 073
Indirect-Downregulator 446
Indirect-Upregulator 3 225
Inhibitor 194
Table 3  语料库规模与数量分布
语料库 模型结构 准确率 F1值
AIMed CNN + Word Representation + Position Features + Lexical Features 0.856 1 0.734 2
GENIA CNN + Word Representation + Position Features + Lexical Features 0.980 6 0.976 4
ChemProt CNN + Word Representation + Position Features + Lexical Features 0.923 6 0.890 0
Table 4  本文模型在AIMed、GENIA和ChemProt语料库上进行语义关系分类的准确率与F1值
Fig.5  AIMed、GENIA和ChemProt语料库在不同CNN模型关系分类效果
语料库 模型 F1值
AIMed 本文模型 0.734 2
Zhang 等[22] (Word, Position, SDP) 0.617 0
Peng等[19] (Word, Position, POS, Chunk, Dependency Information) 0.635 0
Chang等[38](Convolution Tree Kernel) 0.567 0
Hsieh等[41] (LSTMpre 0.769 0
Yadav等[42] (Att-sdpLSTM) 0.932 9
GENIA 本文模型 0.976 4
Ramesh等[40] (SVM + CFR) 0.761 0
ChemProt 本文模型 0.890 0
Corbett等[13] (RNNs + Word) 0.615 1
Lim等[43] (Tree-LSTM: Position +
Syntactic Parse Tree)
0.641 0
Beltagy等[44] (SciBERT) 0.836 4
Table 5  本文模型与其他关系抽取/分类模型性能对比
