﻿ 基于医学本体的术语相似度算法研究<sup>*</sup>

1(徐州医学院临床学院 徐州 221004)
2(江苏师范大学计算机科学与技术学院 徐州 221116)
3(苏州大学计算机科学与技术学院 苏州 215006)

Research on Semantic Similarity Estimation Algorithm of Medical Terminology Based on Medical Ontology
Fan Xuexue1, Wang Zhirong1, Xu Wu1, Liang Yin2, Ma Xiaohu3
1(Clinical Medical School, Xuzhou Medical College, Xuzhou 221004, China)
2(School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China)
3(School of Computer Science and Technology, Soochow University, Suzhou 215006, China)
Abstract

[Objective] Based on the comprehensive medical Ontologies, this paper proposes a new algorithm to enhance the precision of semantic similarity estimation of medical terminology. [Methods] On the basis of the hierarchy and semantic relationships of concepts of SNOMED CT and MeSH, the semantic parameters such as depth and distance are extracted. Then the depth factor and the distance factor are obtained weighted by the concept density, and the function of semantic similarity is thus established. [Results] The algorithm is applicable to both distinctive medical Ontologies, and the experimental results demonstrate that this algorithm has higher correlation coefficient with manual scoring versus conventional algorithms. [Limitations] This algorithm is subject to hierarchy of Ontologies. [Conclusions] The new algorithm benefits the enhanced precision of semantic similarity estimation of medical terminology.

Keyword: Semantic similarity; Medical terminology; Medical Ontology; SNOMED CT; MeSH
1 引言

2 相关研究

3 基于医学本体的相似度算法
3.1 SNOMED CT和MeSH

3.2 基于本体的相似度计算

 (1)

 Figure Option 图1 本体层级结构

 (2)

 (3)

 (4)

 (5)

 (6)

 (7)

3.3 相似度算法描述

①搜索概念c1, c2在本体中是否存在, 若存在则进入步骤②, 否则停止并提示。

②寻找概念c1, c2的n个最近公共祖先节点LCS, 得到最大的最短距离max[path(c1, c2)]和LCS的最大深度max[dept(lcs)]。

③若max[path(c1, c2)]=0, 则sim = 4, 即完全相同, 算法结束。否则进入步骤④。

④检查c1, c2的max[dept(lcs)], 若为1, 则进入步骤⑤, 否则进入步骤⑥。

⑤检查概念c1, c2是否为相关词, 即仅具有相关度不具有相似度的词。若是则令sim=3, 否则认为两概念之间不相似也不相关, 令sim=1, 算法结束。

⑥分别计算概念的深度系数和距离系数, 并根据公式(7)计算两概念之间相似度。

⑦算法结束。

4 实验结果和分析
4.1 实验方案

4.2 实验结果

(1) 与Pedersen评估标准比较

Pedersen评估标准共有30对术语, SNOMED CT 2014中收录了29对, MeSH 2014中收录了25对。对于未收录的术语, 文献[34]的处理方法是在本体中找到与其最为相近的概念代替, 然后再进行相似度计算。参考这种做法, 本文最终计算了29对术语的相似度。由于两本体结构存在很大差异, 经实验, α 、β 在SNOMED CT中取值为α =1, β =1, 在MeSH中取值为α =0.8, β =0.8时结果最接近人工评分。一般采用皮尔逊相关系数衡量各种算法的效果, 将文献[20-21, 32, 34]中测评的算法及本文算法同Pedersen评估标准的相关系数进行比较, 结果如表1所示。

(2) 与Hiaoutakis评估标准比较

Hiaoutakis评估标准中包含36对从MeSH中挑选的概念术语, 由人工从0-1进行打分。文献[36]选取其中的32对术语并列举了以MeSH为本体的Dice, Jaccard, Rodriguez & Egenhofer以及Cosine算法的计算结果。笔者同样选用这32对术语并首先以MeSH为本体进行计算。此外, 也以SNOMED CT作为本体进行了计算, 但由于其中有两对术语未被收录且无相似概念可代替, 因此在以SNOMED CT为本体时仅计算了30对术语。表2是各种算法与Hiaoutakis评估标准的相关系数。

4.3 实验结果分析

(1) 几乎所有算法结果都与编码员评分更为接近, 这是因为医学编码员是经过训练的具有医学分类知识的专业人员, 对于医学词汇的分类能做到更加客观准确, 文献[34]则只与编码员评分结果进行比较。

(2) 经典的基于信息量的算法(第1-3行)和基于语料库的算法(第22-25行)受语料库的规模和专业程度影响较大。

(3) 经典的基于距离的方法与混合算法(第7-18行)在不同的本体中表现差异较大, 尤其在SNOMED CT中表现不佳。

(4) 改进的基于纯本体信息量算法(第4-6行, 第19-21行)比经典信息量算法表现有所提升, 这从一个方面说明基于领域本体的方法精确度优于基于语料库的方法。

(5) 本文算法(第26-27行)在两个本体中均能得到更高的相关系数且结果相近, 两本体结果相关系数为0.978(见本篇论文网络版本支撑数据), 这说明本文算法具有更高的精确度与更好的通用性。

5 结语

 [1] Chen M Y, Chu H C, Chen Y M. Developing a Semantic-Enable Information Retrieval Mechanism[J]. Expert Systems with Application, 2010, 37(1): 322-340. [本文引用:1] [2] Kimtani D K, Choudhury J, Chakrabarty A. Improvement in Word Sense Disambiguation by Introducing Enhancements in English WordNet Structure[J]. International Journal on Computer Science and Engineering, 2012, 4(7): 1366-1370. [本文引用:1] [3] Leroy G, Rindflesch T C. Effects of Information and Machine Learning Algorithms on Word Sense Disambiguation with Small Datasets[J]. International Journal of Medical Informatics, 2005, 74(7-8): 573-585 [本文引用:1] [4] Cilibrasi R L, Vitanyi P M B. The Google Similarity Distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3): 370-383. [本文引用:1] [5] Stevenson M, Greenwood M A. A Semantic Approach to IE Pattern Introduction [C]. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005: 379-386. [本文引用:1] [6] Asservatham S, Bennani Y. Semi-Structured Document Categorization with a Semantic Kernel[J]. Pattern Recognition, 2009, 42(9): 2067-2076. [本文引用:1] [7] Batet M, Valls A, Gibert K. Improving Classical Clustering with Ontologies [C]. In: Proceedings of the 4th World Conference of the IASC, Yokohama, Japan. 2008: 137-146. [本文引用:1] [8] Lu H M, Chen H, Zeng D, et al. Multilingual Chief Complaint Classification for Syndromic Surveillance: An Experiment with Chinese Chief Complaints[J]. International Journal of Medical Informatics, 2009, 78(5): 308-320. [本文引用:1] [9] Papachristoudis G, Diplaris S, Mitkas P A. SoFoCles: Feature Filtering for Microarray Classification Based on Gene Ontology[J]. Journal of Biomedical Informatics, 2010, 43(1): 1-14. [本文引用:1] [10] 盛秋艳. 一种基于本体的语义相似度计算方法[J]. 情报科学, 2012, 30(8): 1238-1241. (Sheng Qiuyan. Research on the Measuring of Semantic Similarity Based Ontology[J]. Information Scinece, 2012, 30(8): 1238-1241. ) [本文引用:1] [11] 刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13. (Liu Hongzhe, Xu De. Ontology Based Semantic Similarity and Relatedness Measures Review[J]. Computer Science, 2012, 39(2): 8-13. ) [本文引用:7] [12] 秦春秀, 祝婷, 赵捧未, 等. 自然语言语义分析研究进展[J]. 图书情报工作, 2014, 58(22): 130-137. (Qin Chunxiu, Zhu Ting, Zhao Pengwei, et al. Research Review on Semantics Analysis of Natural Language[J]. Library and Information Service, 2014, 58(22): 130-137. ) [本文引用:3] [13] Land auer T K, Foltz P W, Laham D. An Introduction to Lantent Semantic Analysis[J]. Discourse Processess, 1998, 25(2-3): 259-284. [本文引用:1] [14] 陈海燕. 基于搜索引擎的词汇语义相似度计算方法[J]. 计算机科学, 2015, 42(1): 261-267. (Chen Haiyan. Measuring Semantic Similarity Between Words Using Web Search Engines[J]. Computer Science, 2015, 42(2): 261-267. ) [本文引用:1] [15] 李赟. 基于中文维基百科的语义知识挖掘相关研究[D]. 北京: 北京邮电大学, 2009. (Li Yun. Mining Semantic Knowledge from Chinese Wikipidia [D]. Beijing: Beijing University of Posts and Telecommunications, 2009. ) [本文引用:1] [16] Lord P W, Stevens R D, Brass A, et al. Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation[J]. Bioinformatics, 2003, 19(10): 1275-1283. [本文引用:1] [17] Resnik P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy [C]. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI95). 1995: 448-453. [本文引用:1] [18] Lin D. An Information-Theoretic Definition of Similarity [C]. In: Proceedings of the 15th International Conference on Machine Learning (ICML98). 1998: 296-304. [本文引用:1] [19] Jiang J J, Conrath D W. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy [C]. In: Proceedings of the 10th International Conference on Research in Computational Linguistics. 1997: 19-33. [本文引用:1] [20] Batet M, Sanchez D, Valls A. An Ontology-Based Measure to Compute Semantic Similarity in Biomedicine[J]. Journal of Biomedical Informatics, 2011, 44(1): 118-125. [本文引用:2] [21] Sanchez D, Batet M. Semantic Similarity Estimation in the Biomedical Domain: An Ontology-Based Information- Theoretic Perspective[J]. Journal of Biomedical Informatics, 2011, 44(5): 749-759. [本文引用:2] [22] 游彬, 严岳松, 孙英阁, 等. 基于HowNet的信息量计算语义相似度算法[J]. 计算机系统应用, 2013, 22(1): 129-133. (You Bin, Yan Yuesong, Sun Yingge, et al. Method of Information Content Evaluating Semantic Similarity on HowNet[J]. Computer Systems & Applications, 2013, 22(1): 129-133. ) [本文引用:1] [23] Rada R, Mili H, Bichnell E, et al. Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man and Cybernetics, 1989, 19(1): 17-30. [本文引用:2] [24] Leacock C, Chodorw M. Combining Local Context and WordNet Similarity for Word Sense Identification [A]. // WordNet: An Electronic Lexical Database [M]. MIT Press, 1998: 265-283. [本文引用:2] [25] Wu Z, Palmer M. Verb Semantics and Lexical Selection [C]. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. Assiciation for Computational Liguistics, 1994: 133-138. [本文引用:2] [26] Tversky A. Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352. [本文引用:1] [27] Patwardhan S, Pedersen T. Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts [C]. In: Proceedings of the EACL Workshop on Making Sense of Sense: Bringing Computaional Linguistics and Psycholinguistics Together, Trento, Italy. 2006: 1-8. [本文引用:1] [28] Banerjee S, Pedersen T. Extended Gloss Overlaps as a Measure of Semantic Relatedness [C]. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI). 2003: 805-810. [本文引用:1] [29] Wan S, Angryk R A. Measuring Semantic Similarity Using Wordnet-Based Context Vectors [C]. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics. 2007: 908-913. [本文引用:1] [30] Li Y, Band er Z A, Mclean D. An Approach for Measuring Semantic Similarity Between Words Using Multiple Information Sources[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 871-882. [本文引用:2] [31] 吴健, 吴朝晖, 李莹, 等. 基于本体论和词汇语义相似度的Web服务发现[J]. 计算机学报, 2005, 28(4): 595-602. (Wu Jian, Wu Zhaohui, Li Ying, et al. Web Service Discovery Based on Ontology and Similarity of Words[J]. Chinese Journal of Computers, 2005, 28(4): 595-602. ) [本文引用:1] [32] Pedersen T, Pakhomov S, Patwardhan S, et al. Measures of Semantic Similarity and Relatedness in the Biomedical Domain[J]. Journal of Biomedical Informatics, 2007, 40(3): 288-299. [本文引用:3] [33] Hliaoutakis A, Varelas G, Voutsakis E, et al. Information Retrieval by Semantic Similarity[J]. International Journal on Semantic Web and Information Systems, 2006, 2(3): 55-73. [本文引用:2] [34] Al-Mubaid H, Nguyen H A. A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain [C]. In: Proceedings of the 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. New York: IEEE Computer Society, 2006: 2713-2717. [本文引用:1] [35] 李文庆, 谢红薇. 基于医疗本体的语义相似度评估方法[J]. 计算机工程与设计, 2013, 34(4): 1287-1291. (Li Wenqing, Xie Hongwei. Semantic Similarity Estimation Method Based on Medical Ontology[J]. Computer Engineering and Design, 2013, 34(4): 1287-1291. ) [本文引用:1] [36] 孙海霞, 钱庆, 吴英杰, 等. MeSH词表的语义计相似度计算研究[J]. 现代图书情报技术, 2010(6): 12-16. (Sun Haixia, Qian Qing, Wu Yingjie, et al. Research on Semantic Similarity Measuring of MeSH[J]. New Technology of Library and Information Service, 2010(6): 12-16. ) [本文引用:1]