Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (11): 1-14    DOI: 10.11925/infotech.2096-3467.2020.0681
Discovering Subject Knowledge in Life and Medical Sciences with Knowledge Graph
Hu Zhengyin1,2(),Liu Leilei1,2,Dai Bing1,2,Qin Xiaochu3,4
1Chengdu Library and Information Center, Chinese Academy of Sciences, Chengdu 610041, China
2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou 510700, China
4Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou 510530, China
[Objective] This paper explores new methods for deep subject knowledge discovery using multi-source heterogeneous data. [Methods] First, we constructed a SPO semantic network of literature to create the core domain knowledge graph. Then, we implemented multi-source heterogeneous data fusion through “entity alignment, concept level fusion and relationship fusion” to obtain the whole domain knowledge graph. Finally, we discovered deep subject knowledge with the help of this knowledge graph. We examined our method with data on Hematopoietic Stem Cell for Cancer Treatment (HSCCT). [Results] This paper proposed a knowledge graph-based framework for subject knowledge discovery (KGSKD), which fuses multi-source heterogeneous data multi-dimensionally and fine-grainedly, enriches semantic relationships among data, and supports knowledge discovery techniques such as knowledge inference, pathfinder, and link prediction natively. [Limitations] KGSKD has some limitations including data supersaturation, poor interpretability of knowledge discovery results and difficulty in communicating with domain experts. [Conclusions] KGSKD has the advantages of “richer data types”, “more comprehensive knowledge linkage”, “more advanced mining methods” and “deeper discovery results”, which effectively supports research and services of deep knowledge discovery in life sciences and medicine.

Key wordsSubject Knowledge Discovery      Knowledge Graph      SPO Triples      Data Fusion      Entity Alignment     
Received: 13 July 2020      Published: 04 December 2020
ZTFLH:  G251  
Corresponding Authors: Hu Zhengyin     E-mail:

Hu Zhengyin,Liu Leilei,Dai Bing,Qin Xiaochu. Discovering Subject Knowledge in Life and Medical Sciences with Knowledge Graph. Data Analysis and Knowledge Discovery, 2020, 4(11): 1-14.

A Diagram of Close Discovery and Open Discovery [17]
序号 主语 主语语义类型 谓语 宾语 宾语语义类型
1 Hemofiltration Therapeutic or Preventive Procedure TREATS Patients Human
2 Digoxin overdose Injury or Poisoning PROCESS_OF Patients Human
3 Hyperkalemia Pathologic Function COMPLICATES Digoxin overdose Injury or Poisoning
4 Hemofiltration Therapeutic or Preventive Procedure TREATS (INFER) Digoxin overdose Injury or Poisoning
Samples of SPO Triples
Framework of KGSKD
A Sample of SPO Semantic Network [22]
序号 映射类型 源知识实体(Term) 目标知识实体(CUI|Concept Name|STY)*
1 一对一映射 Abnormality of neutrophils C0427515| Neutrophil abnormality| Finding
2 多对一映射 Central Nervous System Neoplasms C0085136| Central Nervous System Neoplasms| Neoplastic Process
3 一对多映射 RUNX1 C1335654|RUNX1 gene| Gene or Genome
C1435548| RUNX1 protein, human| Amino Acid, Peptide, or Protein
4 一对无映射 Conjunctival icterus ——
Mapping Types of Knowledge Entities to UMLS[30,31]
The Knowledge Graph-based Knowledge Discovery Techniques
类型 数据库 检索策略 数据量
论文 PubMed (((((((stem cells) OR stem cell)) AND (((((stem cellulose) OR stem. Cellular) OR cello) OR cellar) OR cellphone))) OR ((((((((((((ESC) OR ASC) OR iPS) OR PGC) OR MSC) OR CSC) OR LSC) OR TSC) OR ADSC) OR HSC)) near ((cell) OR cells)))) AND ((Hematopoiet*) AND stem cell*)
24 051篇
专利 Derwent
((((ALLD=(("stem cells" OR "stem cell") NOT ("stem cellulose" or "stem. Cellular" or "cello" or "cellar" or "cellphone")) OR ALLD=((ESC or ASC or iPS or PGC or MSC or CSC or LSC or TSC or ADSC or HSC) near (cells OR cell)) OR ALLD=(("totipotent" or "pluripotent" or "multipotent" or "unipotent" or "progenitor" or "precursor") ADJ (cells OR cell)) OR ALLD=("tissue engineer*" OR "tissue scaffolding " OR "tissue regenerat*of regenerative medicine" OR "tissue expansion of regenerative medicine" OR "tissue therapy of regenerative medicine" OR "tissue culture of regenerative medicine" OR "tissue construction of regenerative medicine" OR "biological material*" OR "animal seed cells") OR ABD=(("skin" OR "cartilage" OR "bone" OR "tendon" OR "myocardiac" OR "cardiac" OR "vascular" OR "nerve" OR "cornea" OR "dental" OR "periodontal") ADJ ("tissue engineer*" or "regenerat*")) OR ALLD=("tissue engineer*" AND biomaterial*) OR SSTO=("regenerative medicine") OR ICR=("C12N0050735" OR "C12N005074" OR "C12N0050789" OR "C12N0050797" OR "C12N005095")) NOT ALLD=("seed*" or "herbicide insect hybrid" or "hybrid" or "root bud seeding" or "hybrid corn " or "plant tissue seed") NOT ALLD=(("fuel cell" or "in-plane switching" or "Intrusion Prevention System") NOT (("non-pluripotent") ADJ (CELL*))) NOT ICR=(H or D or E or F or A01B or A01C or A01H or A01G or A21 or A22 or A23 or A46 or A24 or A47 or A63 or A62 or A44 or A45 or C02 or C03C or C05or OR C06 or C10 or C21 or C07B or C07C or C07D or C07F or C07J))) AND (CC=((WO OR US OR EP OR JP)))) AND (ALLD=(Hematopoiet* and stem cell*));
3 986件
Search Policy and Results of HSC Literatures
语义类型(英文) 语义类型(中文)
Chemicals_Drug 化学物质与药物
Disorder 疾病
Genes_Molecular_Sequence 基因与分子序列
Phenotype 表型
Mutation 突变
Hallmark 癌症标识物
Phenomena 现象
Procedure 程序活动
Device 设备
Physiology 生理学
Concepts(including gene, cell, virus, etc.) 概念(包含基因、细胞、病毒等)
Living_Being 生物
PN 专利
Semantic Types of HSCCT Knowledge Entities [30]
语义关系对象 语义分组
(Semantic Group)
语义关系(Semantic Relationship)
相互作用关系 ASSOCIATED_WITH(mutation_to_disease, mutation_to_phenotype, gene_to_mutation, gene_to_disease, gene_to_phenotype, gene Related);
共现关系 cooccurrence
隶属关系 belong_to_PMID
Semantic Relations in HSCCT Knowledge Graph [30]
LinkPaths Between Vaccines and Placental Growth Factor
