Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (11): 15-25     https://doi.org/10.11925/infotech.2096-3467.2020.0299
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于模式和投影学习的领域概念上下位关系自动识别研究*
王思丽1,2(),祝忠明1,2,杨恒1,刘巍1
1中国科学院西北生态环境资源研究院文献情报中心 兰州 730000
2中国科学院大学 北京 100049
Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning
Wang Sili1,2(),Zhu Zhongming1,2,Yang Heng1,Liu Wei1
1Literature and Information Center of Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou 730000, China
2University of Chinese Academy of Sciences, Beijing 100049, China
全文: PDF (765 KB)   HTML ( 24
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 实现对领域概念上下位关系的自动识别,以解决领域本体自动化构建中领域概念间语义关系的自动获取和确立问题。【方法】 将传统无监督的基于模式的方法和当前先进的有监督的基于投影学习的方法有机结合起来,应用于领域概念上下位关系自动识别,并进行了实验研究。【结果】 能识别出领域概念的上位词集合,在医学领域的识别精度为0.88,通用领域的识别精度为0.83,在评估基准集BLESS上的平均精度为0.85。【局限】 受句法歧义、语料集质量等影响,模型精度尚未达到峰值,存在错误识别的情况。【结论】 可发现同一概念词的不同意义的上位词,对低频词和命名实体也具有较好识别效果。未来可考虑从对高频顶层上位词进行适当减权、提升有监督语料集的质量等方面进行优化。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王思丽
祝忠明
杨恒
刘巍
关键词 Hearst模式投影学习词嵌入上下位关系领域概念    
Abstract

[Objective] This paper tries to automatically identify the hypernym-hyponym relations of domain concepts and establish their ontology. [Methods] First, we combined the traditional unsupervised pattern-based method and the advanced supervised-based projection learning method to automatically extract domain concepts. Then, we examined our new method with an empirical study. [Results] The proposed method could identify the hypernym sets of domain concepts. The identification accuracy in medical and general fields, as well as with the benchmark dataset BLESS were 0.88, 0.83, and 0.85 respectively. [Limitations] More research is needed to reduce the weight of high-frequency top-level words and improve the corpus quality. There are also some misidentified relationships. [Conclusions] The proposed model could find hypernym with different meanings for the same concept, which could also extract low-frequency words and named entities.

Key wordsHearst Pattern    Projection Learning    Word Embedding    Hypernym-Hyponym Relations    Domain Concept
收稿日期: 2020-04-09      出版日期: 2020-12-04
ZTFLH:  TP391  
基金资助:*本文系国家科技部重点研发计划课题“应对气候变化科学数据与知识集成共享平台建设”(2018YFC1509007);中国科学院2019年西部之光项目“开放学术资源的情景化组织与服务研究”(Y9AX011001);中国科学院西北生态环境资源研究院文献情报中心2018年文献情报创新能力建设项目“基于深度学习的领域本体自动构建方法研究”的研究成果之一(Y8AJ012005)
通讯作者: 王思丽     E-mail: wangsl@llas.ac.cn
引用本文:   
王思丽,祝忠明,杨恒,刘巍. 基于模式和投影学习的领域概念上下位关系自动识别研究*[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning. Data Analysis and Knowledge Discovery, 2020, 4(11): 15-25.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0299      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I11/15
Fig.1  基于模式和投影学习的上下位关系自动识别研究框架
英文模式 中文模式
Y such as X Y例如/比如X
Y other than X 除了Y之外的X/ Y不仅是X
Y including X Y包含X
Y especially X Y尤其/特别是X
not all Y are X 不全是/并不是所有的Y都是X
Y like X Y类似X
Y for example X Y例如/比如/示例X
Y which includes X Y是那些包含X
X are also Y X也是Y
X are all Y X都是Y
not Y so much as X 没有Y而是X
Y is a X Y是一种/个/只…X
Table 1  基于扩展Hearst模式的上位词识别模式示例
实验方法 实验设置 实验结果
(平均精度AP
①模式 扩展Hearst模式: 分布假设 + 共同下位词识别模式 通用领域:0.38
医学领域:0.41
②投影学习 Word2Vec 100维、训练迭代次数10、单投影1、无负采样、无高频词亚采样 通用领域:0.54
医学领域:0.60
③投影学习 Word2Vec 200维、训练迭代次数20、多投影24、负采样15、高频词亚采样阈值1e-5 通用领域:0.66
医学领域:0.72
④模式 +
投影学习
扩展Hearst模式 + 训练迭代次数20、Word2Vec 200维、多投影24、负采样15、高频词亚采样阈值1e-5 通用领域:0.83
医学领域:0.88
BLESS评估集:0.85
Table 2  上下位关系识别实验对比
医学领域概念词 上位词集合(Top5)
Aneurysm(动脉瘤) procedure; clinical finding; soft tissue lesion; anatomical structure; disease
Diagnostic lumbar puncture(诊断性腰椎穿刺) clinical finding; disease; procedure; sickness; illness
Vertebra(脊椎) body region; bone; body structure; fracture; anatomical structure
Thymosin(胸腺肽) protein; biopolymer; enzyme;
hydrolase; lyase
Pain assessment(疼痛评估) pain; sickness; disease; illness;
practice of medicine
Table 3  医学领域概念上下位关系识别结果示例
通用领域概念词 上位词集合(Top5)
Miscreant(不法之徒) person; bad person; wrongdoer; actor; politician
Queen Elizabeth
(伊丽莎白女王)
person; king; monarch; aristocrat; patrician
Microcontroller(微控制器) electronic circuit; circuitry; pc board; computer chip; electrical device
Business concern
(商业公司/业务关注点)
corporation; business organization; government agency; business firm; written agreement
Vegetarian(素食者/素的) dessert; dish; recipe; food product; person
Table 4  通用领域概念上下位关系识别结果示例
[1] WordNet-A Lexical Database for English[DB/OL]. [2019-10-20]. https://wordnet.princeton.edu/.
[2] Cyc: Logical Reasoning with the World’s Largest Knowledge Base[DB/OL]. [2019-11-09]. http://www.cyc.com/.
[3] 程韵如. 基于维基百科的领域实体上下位关系抽取[J]. 价值工程, 2016,35(18):160-163.
[3] ( Cheng Yunru. Hyponymy Extraction of Domain Entity Based on Wikipedia[J]. Value Engineering, 2016,35(18):160-163.)
[4] 唐恩博. 基于WordNet的蒙古文名词语义网上下位语义关系树构造方法的研究[D]. 呼和浩特: 内蒙古师范大学, 2014.
[4] ( Tang Enbo. Research on Construction Method of Mongolian Noun Semantic Network Hyponymy Tree Based on WordNet[D]. Huhhot: Inner Mongolia Normal University, 2014.)
[5] Gunawan, Pranata E. Acquisition of Hypernymy-Hyponymy Relation Between Nouns for WordNet Building[C]// Proceedings of the 2010 International Conference on Asian Language Processing. 2010: 114-117.
[6] Hearst M A. Automatic Acquisition of Hyponyms from Large Text Corpora[C]// Proceedings of the 14th International Conference on Computational Linguistics. 1992,2:539-545.
[7] Roller S, Katrin E K. Relations such as Hypernymy: Identifying and Exploiting Hearst Patterns in Distributional Vectors for Lexical Entailment[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 2163-2172.
[8] 刘磊, 曹存根, 王海涛, 等. 一种基于“是一个”模式的下位概念获取方法[J]. 计算机科学, 2006,33(9):146-151.
[8] ( Liu Lei, Cao Cungen, Wang Haitao, et al. A Method of Hyponym Acquisition Based on “isa” Pattern[J]. Computer Science, 2006,33(9):146-151.)
[9] 汤青, 吕学强, 李卓. 本体概念间上下位关系抽取研究[J]. 微电子学与计算机, 2014(6):68-71.
[9] ( Tang Qing, Lv Xueqiang, Li Zhuo. Research on Domain Ontology Concept Hyponymy Relation Extraction[J]. Microelectronics & Computer, 2014(6):68-71.)
[10] Geffet M, Dagan I. The Distributional Inclusion Hypotheses and Lexical Entailment[C]// Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005: 107-114.
[11] Kotlerman L, Dagan I, Szpektor I, et al. Directional Distributional Similarity for Lexical Inference[J]. Natural Language Engineering, 2010,16(4):359-389.
doi: 10.1017/S1351324910000124
[12] Baroni M, Lenci A. How We BLESSed Distributional Semantic Evaluation[C]// Proceedings of the GEMS 2011 Workshop on Geometrical Models of Natural Language Semantics. 2011: 1-10.
[13] Mei K W, Syed S R A, Ian D J. A Multi-Phase Correlation Search Framework for Mining Non-Taxonomic Relations from Unstructured Text[J]. Knowledge and Information Systems, 2014,38(3):641-667.
doi: 10.1007/s10115-012-0593-7
[14] Roller S, Erk K, Boleda G. Inclusive Yet Selective: Supervised Distributional Hypernymy Detection[C]// Proceedings of the 25th International Conference on Computational Linguistics. 2014: 1025-1036.
[15] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[16] Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[DB/OL]. [2018-12-29]. https://nlp.stanford.edu/projects/glove/.
[17] Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[OL]. arXiv Preprint, arXiv: 1802. 05365.
[18] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
[19] Fu R J, Guo J, Qin B, et al. Learning Semantic Hierarchies via Word Embeddings[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. USA, 2014: 1199-1209.
[20] Yu Z, Wang H X, Lin X M, et al. Learning Term Embeddings for Hypernymy Identification[C]// Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015). 2015: 1390-1397.
[21] Wang C Y, He X F. Chinese Hypernym-Hyponym Extraction from User Generated Categories[C]// Proceedings of the 26th International Conference on Computational Linguistics. 2016: 1350-1361.
[22] 余弦相似度[EB/OL]. [2019-10-15]. https://baike.baidu.com/item/余弦相似度.
[22] ( Cosine Similarity[EB/OL]. [2019-10-15]. https://baike.baidu.com/item/余弦相似度.)
[23] Yamane J, Takatani T, Yamada H, et al. Distributional Hypernym Generation by Jointly Learning Clusters and Projections[C]// Proceedings of the 26th International Conference on Computational Linguistics. 2016: 1871-1879.
[24] Ustalov D, Arefyev N, Biemann C, et al. Negative Sampling Improves Hypernymy Extraction Based on Projection Learning[OL]. arXiv Preprint, arXiv: 1707. 03903.
[25] PubMed Data[DB/OL]. [2019-08-15]. https://www.nlm.nih.gov/databases/download/pubmed_medline.html.
[26] SnomedCT[DB/OL]. [2019-08-10]. http://browser.ihtsdotools.org/.
[27] UMBC Corpus[DB/OL]. [2019-10-25]. http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words.
[28] WordNet[DB/OL]. [2019-10-25]. http://wordnetweb.princeton.edu/perl/webwn?s=dog&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=1010000000.
[29] Python Interface to Google Word2Vec[DB/OL]. [2019-08-15]. https://github.com/danielfrg/word2vec.
[30] PyTorch[DB/OL]. [2019-08-15]. https://pytorch.org/.
[31] BLESS Dataset[DB/OL]. [2019-11-27]. https://sites.google.com/site/geometricalmodels/shared-evaluation.
[1] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[2] 沈思,李沁宇,叶媛,孙豪,叶文豪. 基于TWE模型的医学科技报告主题挖掘及演化分析研究*[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[3] 戴志宏, 郝晓玲. 上下位关系抽取方法及其在金融市场的应用*[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[4] 苏传东,黄孝喜,王荣波,谌志群,毛君钰,朱嘉莹,潘宇豪. 基于词嵌入融合和循环神经网络的中英文隐喻识别*[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[5] 宰新宇,田学东. 基于公式描述结构和词嵌入的科技文档检索方法*[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[6] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[7] 李琳, 李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[8] 王婷婷, 韩满, 王宇. LDA模型的优化及其主题数量选择研究*——以科技文献为例[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[9] 张琴, 郭红梅, 张智雄. 融合词嵌入表示特征的实体关系抽取方法研究*[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[10] 陈果, 肖璐. 网络社区中的知识元链接体系构建研究*[J]. 数据分析与知识发现, 2017, 1(11): 75-83.
[11] 余昕聪, 李红莲, 吕学强. 本体上下位关系在招生问答机器人中的应用研究[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[12] 张巍,于洋,游宏梁. 面向词汇知识库自动构建的概念术语关系识别[J]. 现代图书情报技术, 2009, 25(11): 10-16.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn