Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (4): 41-49    DOI: 10.11925/infotech.1003-3513.2015.04.06
Current Issue | Archive | Adv Search |
The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set
Duan Yufeng1, Zhu Wenjing2, Chen Qiao1, Liu Wei3, Liu Fenghong4
1 Business School, East China Normal University, Shanghai 200241, China;
2 Shanghai Library, Shanghai 200031, China;
3 School of Public Economics and Administration, Shanghai University of Finance and Economics, Shanghai 200433, China;
4 Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Establish a model to improve the out-of-vocabulary identification capability, reduce the cost of manual intervention. [Methods] On the basis of the hypothesis, a out-of-vocabulary identification model is set up combining CRFs and domain Ontology elements set. Using biodiversity text as samples, the rationality of the model is verified by comparing the performance differences among models and testing hypothesis. [Results] The experimental results show that the model established by this study has the best identification capability. The results prove that the hypothesis is true, and the model is reasonable and scientific. [Limitations] The tagging accuracy of the model remains to be improved. [Conclusions] The model established in this paper has better identification capability, while greatly reducing the cost of artificial training dataset.

Key wordsCRFs      Domain Ontology      Out-of-vocabulary identification     
Received: 19 September 2014      Published: 21 May 2015
:  TP391.1  

Cite this article:

Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set. New Technology of Library and Information Service, 2015, 31(4): 41-49.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.04.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I4/41

[1] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19. (Huang Changning, Zhao Hai. Chinese Word Segmentation: A Decade Review [J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
[2] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报:自然科学版, 2013, 49(1): 159-164. (Yue Jinyuan, Xu Jin'an, Zhang Yujie. Chinese Word Segmentation for Patent Documents [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[3] Ahmad K, Davies A, Fulford H, et al. What is a Term? The Semi-automatic Extraction of Terms from Text [A].//Snell-Hornby M, Pöchhacker F, Kaindl K. Translation Studies: An Interdiscipline [M]. Amsterdam: John Benjamins Publishing Company, 1994: 267-278.
[4] 翟笃风, 刘柏嵩. 政务领域本体术语的自动抽取[J]. 现代图书情报技术, 2010(4): 59-65. (Zhai Dufeng, Liu Baisong. Automatic Domain-specific Term Extraction in Administrative-domain Ontology [J]. New Technology of Library and Information Service, 2010(4): 59-65.)
[5] Pantel P, Lin D. A Statistical Corpus-based Term Extractor [C]. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence (AI'01). London: Springer-Verlag, 2001: 36-46.
[6] 刘桃, 刘秉权, 徐志明, 等. 领域术语自动抽取及其在文本分类中的应用[J]. 电子学报, 2007, 35(2): 328-332. (Liu Tao, Liu Bingquan, Xu Zhiming, et al. Automatic Domain- Specific Term Extraction and Its Application in Text Classification [J]. Acta Electronica Sinica, 2007, 35(2): 328-332.)
[7] 施水才, 王锴, 韩艳铧, 等. 基于条件随机场的领域术语识别研究[J]. 计算机工程与应用, 2013, 49(10): 147-149. (Shi Shuicai, Wang Kai, Han Yanhua, et al. Terminology Recognition Based on Conditional Random Fields [J]. Computer Engineering and Applications, 2013, 49(10): 147-149.)
[8] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12):54-58. (Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model [J]. New Technology of Library and Information Service, 2008(12): 54-58.)
[9] 荀恩东, 李晟. 采用术语定义模式和多特征的新术语及定义识别方法[J]. 计算机研究与发展, 2009, 46(1): 62-69. (Xun Endong, Li Cheng. Applying Terminology Definition Pattern and Multiple Features to Identify Technical New Term and Its Definition [J]. Journal of Computer Research and Development, 2009, 46(1): 62-69.)
[10] Berger A L, Pietray S A D, Pietray V J D. A Maximum Entropy Approach to Natural Language Processing [J]. Computational Linguistics, 1996, 22(1): 39-71.
[11] 潘正高. 基于规则和统计相结合的中文命名实体识别研究[J]. 情报科学, 2012, 30(5): 708-712, 786. (Pan Zhenggao. Research on the Recognition of Chinese Named Entity Based on Rules and Statistics [J]. Information Science, 2012, 30(5): 708-712, 786.)
[12] 孙海霞, 李军莲, 吴英杰, 等. 基于混合策略的中文生物医学领域未登录词识别研究[J]. 现代图书情报技术, 2013(1): 15-21. (Sun Haixia, Li Junlian, Wu Yingjie, et al. The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method [J]. New Technology of Library and Information Service, 2013(1): 15-21.)
[13] 黄诗琳, 郑小林, 陈德人. 针对产品命名实体识别的半监督学习方法[J]. 北京邮电大学学报, 2013, 36(2): 20-23,54. (Huang Shilin, Zheng Xiaolin, Chen Deren. A Semi- Supervised Learning Method for Product Named Entity Recognition [J]. Journal of Beijing University of Posts and Telecommunications, 2013, 36(2): 20-23, 54.)
[14] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML'01). San Francisco: Morgan Kaufmann Publishers Inc., 2001: 282-289.
[15] Lee Y, Kim M, Lee J. Chunking Using Conditional Random Fields in Korean Texts [C]. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP'05). Berlin, Heidelberg: Springer- Verlag, 2005: 155-164.
[16] Ram R V S, Devi S L. Clause Boundary Identification Using Conditional Random Fields[C]. In: Proceedings of CICLing 2008, Haifa, Israel. Springer Berlin Heidelberg, 2008: 140-150.
[17] Zhou D, He Y. Learning Conditional Random Fields from Unaligned Data for Natural Language Understanding [C]. In: Proceedings of the 33rd European Conference on Advances in Information Retrieval (ECIR'11). Berlin, Heidelberg: Springer-Verlag, 2011:283-288.
[18] Zheng L, Lv X, Liu K, et al. Recognition of Chinese Personal Names Based on CRFs and Law of Names [C]. In: Proceedings of the 14th International Conference on Web Technologies and Applications (APWeb'12). Berlin, Heidelberg: Springer-Verlag, 2012:163-170.
[19] ICTCLAS汉语分词系统. ICTCLAS特色[EB/OL]. [2014-08-16]. http://www.ictclas.org/ictclas_feature.html. (ICTCLAS. Characteristic of ICTCLAS [EB/OL]. [2014-08-16]. http://www.ictclas.org/ictclas_feature.html.)
[20] 刘群, 张华平, 张浩. 计算所汉语词性标记集 Version3.0 [EB/OL]. [2014-08-16]. http://www.ictclas.org/docs/ICTPOS3.0汉语词性标记集.doc. (Liu Qun, Zhang Huaping, Zhang Hao. POS Tag Set of ICT Version3.0 [EB/OL]. [2014-08-16]. http://www.ictclas.org/docs/ICTPOS3.0汉语词性标记集. doc.)
[21] CRF++: Yet Another CRF Toolkit [EB/OL]. [2013-07-15]. http://crfpp.googlecode.com/svn/trunk/doc/index.html.
[22] 于江德, 王希杰, 樊孝忠. 基于最大熵模型的词位标注汉语分词[J]. 郑州大学学报: 理学版, 2011, 43(l): 70-74. (Yu Jiangde, Wang Xijie, Fan Xiaozhong. Chinese Word Segmentation via Word Position Tagging Based on Maximum Entropy Model [J]. Journal of Zhengzhou University: Natural Science Edition, 2011, 43(1): 70-74.)
[23] Tseng H, Chang P, Andrew G, et al. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005[C]. In: Proceedings of the 4th Sighan Workshop on Chinese Language Processing, Korea. 2005:168-171.
[24] 许晓丽, 卢志茂, 张格森. 基于条件随机场的中文命名实体识别研究[J]. 中国新技术新产品, 2009(2): 15. (Xu Xiaoli, Lu Zhimao, Zhang Gesen. Study on Conditional Random Fields Based Chinese Named Entity Recognition [J]. China New Technologies and Products, 2009(2): 15.)
[25] Zhao H, Huang C, Li M. An Improved Chinese Word Segmentation System with Conditional Random Field [C]. In: Proceedings of the 5th Sighan Workshop on Chinese Language Processing, Sydney, Australia. 2006: 108-117.

[1] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[2] He Youshi,He Shufang. Sentiment Mining of Online Product Reviews Based on Domain Ontology[J]. 数据分析与知识发现, 2018, 2(8): 60-68.
[3] Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[4] Lu Jiaying,Yuan Qinjian,Huang Qi,Qian Yunjie. Building Product Domain Ontology with Concept Lattice Theory[J]. 现代图书情报技术, 2016, 32(5): 38-46.
[5] Bao Yulai,Bi Qiang. Semantic Retrieval for Mongolian Music: An Explorative Study[J]. 现代图书情报技术, 2016, 32(11): 94-100.
[6] Zhang Fan, Le Xiaoqiu. Research on Recognition of Concept Attribute Instances in Innovation Sentences of Scientific Research Paper[J]. 现代图书情报技术, 2015, 31(5): 15-23.
[7] Duan Yufeng, Huang Sisi. Research on Construction of Chinese Plant Species Diversity Domain Ontology Based on BFO[J]. 现代图书情报技术, 2015, 31(12): 72-79.
[8] Yan Shiyan, Wang Shengqing, Luo Yunchuan, Huang Haojun. An Ontology Collaborative Construction Model Based on FCA in Cloud Computing Environment[J]. 现代图书情报技术, 2014, 30(3): 49-56.
[9] Shi Cui, Wang Yang, Yang Bin, Yao Ye. Identification of Non-nest Coordination for Chinese Patent Literature[J]. 现代图书情报技术, 2014, 30(10): 76-83.
[10] Meng Meiren, Ding Shengchun. Research on the Credibility of Online Chinese Product Reviews[J]. 现代图书情报技术, 2013, 29(9): 60-66.
[11] Yao Xiaona, Zhu Zhongming, Wang Sili. Research on Automatic Semantic Annotation for Geosciences[J]. 现代图书情报技术, 2013, (4): 48-53.
[12] Xu Xin, Guo Jinlong. Construction of Subject Knowledge Base——Taking the Domain of Chinese Cuisine Culture as an Example[J]. 现代图书情报技术, 2013, (12): 2-9.
[13] Guo Jinlong, Hong Yunjia, Xu Xin. Construction and Application of Ontology in the Domain of Chinese Cuisine Culture[J]. 现代图书情报技术, 2013, (12): 10-18.
[14] Hong Yunjia, Xu Xin. Study on Multi-level Text Clustering for Knowledge Base Based on Domain Ontology——Taking Knowledge Base of Chinese Cuisine Culture as an Example[J]. 现代图书情报技术, 2013, (12): 19-26.
[15] Jin Biyi, Guo Jinlong, Xu Xin. Research on Using Domain Ontology to Optimize the Document Retrieval——Design and Implementation on the KIM Platform[J]. 现代图书情报技术, 2013, (12): 27-33.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn