Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (4): 41-49    DOI: 10.11925/infotech.1003-3513.2015.04.06
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
条件随机场与领域本体元素集相结合的未登录词识别研究
段宇锋1, 朱雯晶2, 陈巧1, 刘伟3, 刘凤红4
1 华东师范大学商学院 上海 200241;
2 上海图书馆 上海 200031;
3 上海财经大学公共经济与管理学院 上海 200433;
4 中国科学院植物研究所 北京 100093
The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set
Duan Yufeng1, Zhu Wenjing2, Chen Qiao1, Liu Wei3, Liu Fenghong4
1 Business School, East China Normal University, Shanghai 200241, China;
2 Shanghai Library, Shanghai 200031, China;
3 School of Public Economics and Administration, Shanghai University of Finance and Economics, Shanghai 200433, China;
4 Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
全文: PDF(616 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]建立未登录词识别模型, 提升发现自然科学领域文本中未登录词的能力, 同时降低人工干预成本。[方法]在假设的基础上, 构建条件随机场(CRFs)与领域本体元素集相结合的未登录词识别模型。以生物多样性文本为样本, 通过比较不同模型性能的差异, 检验假设, 验证模型的合理性。[结果]实验结果表明, CRFs模型选择单纯的字、字词混合序列、字词混合序列及默认词性、字词混合序列及含自定义语义功能标记的词性为特征时, 未登录词识别能力依次提升。该结果证明研究假设为真, 本文建立的模型科学、合理。[局限]模型标注未登录词的准确性有待提升。[结论]该模型具有更强的未登录词识别能力, 同时可以极大地降低人工建立训练集的成本。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
朱雯晶
陈巧
刘凤红
段宇锋
刘伟
关键词 条件随机场领域本体未登录词识别    
Abstract

[Objective] Establish a model to improve the out-of-vocabulary identification capability, reduce the cost of manual intervention. [Methods] On the basis of the hypothesis, a out-of-vocabulary identification model is set up combining CRFs and domain Ontology elements set. Using biodiversity text as samples, the rationality of the model is verified by comparing the performance differences among models and testing hypothesis. [Results] The experimental results show that the model established by this study has the best identification capability. The results prove that the hypothesis is true, and the model is reasonable and scientific. [Limitations] The tagging accuracy of the model remains to be improved. [Conclusions] The model established in this paper has better identification capability, while greatly reducing the cost of artificial training dataset.

Key wordsCRFs    Domain Ontology    Out-of-vocabulary identification
收稿日期: 2014-09-19     
:  TP391.1  
基金资助:

本文系国家社会科学基金一般项目“基于无监督语义标注的网络中文学术信息抽取研究”(项目编号:11BTQ024)的研究成果之一。

通讯作者: 段宇锋,ORCID:0000-0002-4319-2837,E-mail:yfduan@infor.ecnu.edu.cn     E-mail: yfduan@infor.ecnu.edu.cn
作者简介: 作者贡献声明: 段宇锋:提出研究思路,设计研究方案,论文起草和最终版本修订;朱雯晶,陈巧:开发程序,进行实验,整理数据,论文起草;刘伟:分析数据;刘凤红:建立PBO。
引用本文:   
段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2015.04.06.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.04.06

[1] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19. (Huang Changning, Zhao Hai. Chinese Word Segmentation: A Decade Review [J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
[2] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报:自然科学版, 2013, 49(1): 159-164. (Yue Jinyuan, Xu Jin'an, Zhang Yujie. Chinese Word Segmentation for Patent Documents [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[3] Ahmad K, Davies A, Fulford H, et al. What is a Term? The Semi-automatic Extraction of Terms from Text [A].//Snell-Hornby M, Pöchhacker F, Kaindl K. Translation Studies: An Interdiscipline [M]. Amsterdam: John Benjamins Publishing Company, 1994: 267-278.
[4] 翟笃风, 刘柏嵩. 政务领域本体术语的自动抽取[J]. 现代图书情报技术, 2010(4): 59-65. (Zhai Dufeng, Liu Baisong. Automatic Domain-specific Term Extraction in Administrative-domain Ontology [J]. New Technology of Library and Information Service, 2010(4): 59-65.)
[5] Pantel P, Lin D. A Statistical Corpus-based Term Extractor [C]. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence (AI'01). London: Springer-Verlag, 2001: 36-46.
[6] 刘桃, 刘秉权, 徐志明, 等. 领域术语自动抽取及其在文本分类中的应用[J]. 电子学报, 2007, 35(2): 328-332. (Liu Tao, Liu Bingquan, Xu Zhiming, et al. Automatic Domain- Specific Term Extraction and Its Application in Text Classification [J]. Acta Electronica Sinica, 2007, 35(2): 328-332.)
[7] 施水才, 王锴, 韩艳铧, 等. 基于条件随机场的领域术语识别研究[J]. 计算机工程与应用, 2013, 49(10): 147-149. (Shi Shuicai, Wang Kai, Han Yanhua, et al. Terminology Recognition Based on Conditional Random Fields [J]. Computer Engineering and Applications, 2013, 49(10): 147-149.)
[8] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12):54-58. (Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model [J]. New Technology of Library and Information Service, 2008(12): 54-58.)
[9] 荀恩东, 李晟. 采用术语定义模式和多特征的新术语及定义识别方法[J]. 计算机研究与发展, 2009, 46(1): 62-69. (Xun Endong, Li Cheng. Applying Terminology Definition Pattern and Multiple Features to Identify Technical New Term and Its Definition [J]. Journal of Computer Research and Development, 2009, 46(1): 62-69.)
[10] Berger A L, Pietray S A D, Pietray V J D. A Maximum Entropy Approach to Natural Language Processing [J]. Computational Linguistics, 1996, 22(1): 39-71.
[11] 潘正高. 基于规则和统计相结合的中文命名实体识别研究[J]. 情报科学, 2012, 30(5): 708-712, 786. (Pan Zhenggao. Research on the Recognition of Chinese Named Entity Based on Rules and Statistics [J]. Information Science, 2012, 30(5): 708-712, 786.)
[12] 孙海霞, 李军莲, 吴英杰, 等. 基于混合策略的中文生物医学领域未登录词识别研究[J]. 现代图书情报技术, 2013(1): 15-21. (Sun Haixia, Li Junlian, Wu Yingjie, et al. The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method [J]. New Technology of Library and Information Service, 2013(1): 15-21.)
[13] 黄诗琳, 郑小林, 陈德人. 针对产品命名实体识别的半监督学习方法[J]. 北京邮电大学学报, 2013, 36(2): 20-23,54. (Huang Shilin, Zheng Xiaolin, Chen Deren. A Semi- Supervised Learning Method for Product Named Entity Recognition [J]. Journal of Beijing University of Posts and Telecommunications, 2013, 36(2): 20-23, 54.)
[14] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML'01). San Francisco: Morgan Kaufmann Publishers Inc., 2001: 282-289.
[15] Lee Y, Kim M, Lee J. Chunking Using Conditional Random Fields in Korean Texts [C]. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP'05). Berlin, Heidelberg: Springer- Verlag, 2005: 155-164.
[16] Ram R V S, Devi S L. Clause Boundary Identification Using Conditional Random Fields[C]. In: Proceedings of CICLing 2008, Haifa, Israel. Springer Berlin Heidelberg, 2008: 140-150.
[17] Zhou D, He Y. Learning Conditional Random Fields from Unaligned Data for Natural Language Understanding [C]. In: Proceedings of the 33rd European Conference on Advances in Information Retrieval (ECIR'11). Berlin, Heidelberg: Springer-Verlag, 2011:283-288.
[18] Zheng L, Lv X, Liu K, et al. Recognition of Chinese Personal Names Based on CRFs and Law of Names [C]. In: Proceedings of the 14th International Conference on Web Technologies and Applications (APWeb'12). Berlin, Heidelberg: Springer-Verlag, 2012:163-170.
[19] ICTCLAS汉语分词系统. ICTCLAS特色[EB/OL]. [2014-08-16]. http://www.ictclas.org/ictclas_feature.html. (ICTCLAS. Characteristic of ICTCLAS [EB/OL]. [2014-08-16]. http://www.ictclas.org/ictclas_feature.html.)
[20] 刘群, 张华平, 张浩. 计算所汉语词性标记集 Version3.0 [EB/OL]. [2014-08-16]. http://www.ictclas.org/docs/ICTPOS3.0汉语词性标记集.doc. (Liu Qun, Zhang Huaping, Zhang Hao. POS Tag Set of ICT Version3.0 [EB/OL]. [2014-08-16]. http://www.ictclas.org/docs/ICTPOS3.0汉语词性标记集. doc.)
[21] CRF++: Yet Another CRF Toolkit [EB/OL]. [2013-07-15]. http://crfpp.googlecode.com/svn/trunk/doc/index.html.
[22] 于江德, 王希杰, 樊孝忠. 基于最大熵模型的词位标注汉语分词[J]. 郑州大学学报: 理学版, 2011, 43(l): 70-74. (Yu Jiangde, Wang Xijie, Fan Xiaozhong. Chinese Word Segmentation via Word Position Tagging Based on Maximum Entropy Model [J]. Journal of Zhengzhou University: Natural Science Edition, 2011, 43(1): 70-74.)
[23] Tseng H, Chang P, Andrew G, et al. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005[C]. In: Proceedings of the 4th Sighan Workshop on Chinese Language Processing, Korea. 2005:168-171.
[24] 许晓丽, 卢志茂, 张格森. 基于条件随机场的中文命名实体识别研究[J]. 中国新技术新产品, 2009(2): 15. (Xu Xiaoli, Lu Zhimao, Zhang Gesen. Study on Conditional Random Fields Based Chinese Named Entity Recognition [J]. China New Technologies and Products, 2009(2): 15.)
[25] Zhao H, Huang C, Li M. An Improved Chinese Word Segmentation System with Conditional Random Field [C]. In: Proceedings of the 5th Sighan Workshop on Chinese Language Processing, Sydney, Australia. 2006: 108-117.

[1] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] 何有世,何述芳. 基于领域本体的产品网络口碑信息多层次细粒度情感挖掘*[J]. 数据分析与知识发现, 2018, 2(8): 60-68.
[3] 唐慧慧,王昊,张紫玄,王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[4] 王东波,吴毅,叶文豪,刘睿伦. 多特征知识下的食品安全事件实体抽取研究*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[5] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[6] 张琳,秦策,叶文豪. 基于条件随机场的法言法语实体自动识别模型研究*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[7] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[8] 陆佳莹,袁勤俭,黄奇,钱韵洁. 基于概念格理论的产品领域本体构建研究*[J]. 现代图书情报技术, 2016, 32(5): 38-46.
[9] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[10] 鲍玉来,毕强. 蒙古文音乐领域的语义检索初探*[J]. 现代图书情报技术, 2016, 32(11): 94-100.
[11] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[12] 张帆, 乐小虬. 领域科技文献创新点句中主题属性实例识别方法研究[J]. 现代图书情报技术, 2015, 31(5): 15-23.
[13] 段宇锋, 黄思思. 基于BFO构建中文植物物种多样性领域本体的研究[J]. 现代图书情报技术, 2015, 31(12): 72-79.
[14] 姜春涛. 自动标注中文专利的引文信息[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[15] 何宇, 吕学强, 徐丽萍. 新能源汽车领域中文术语抽取方法[J]. 现代图书情报技术, 2015, 31(10): 88-94.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn