The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set
Duan Yufeng1, Zhu Wenjing2, Chen Qiao1, Liu Wei3, Liu Fenghong4
1 Business School, East China Normal University, Shanghai 200241, China;
2 Shanghai Library, Shanghai 200031, China;
3 School of Public Economics and Administration, Shanghai University of Finance and Economics, Shanghai 200433, China;
4 Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
[Objective] Establish a model to improve the out-of-vocabulary identification capability, reduce the cost of manual intervention. [Methods] On the basis of the hypothesis, a out-of-vocabulary identification model is set up combining CRFs and domain Ontology elements set. Using biodiversity text as samples, the rationality of the model is verified by comparing the performance differences among models and testing hypothesis. [Results] The experimental results show that the model established by this study has the best identification capability. The results prove that the hypothesis is true, and the model is reasonable and scientific. [Limitations] The tagging accuracy of the model remains to be improved. [Conclusions] The model established in this paper has better identification capability, while greatly reducing the cost of artificial training dataset.
段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set. New Technology of Library and Information Service, 2015, 31(4): 41-49.
[1] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19. (Huang Changning, Zhao Hai. Chinese Word Segmentation: A Decade Review [J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
[2] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报:自然科学版, 2013, 49(1): 159-164. (Yue Jinyuan, Xu Jin'an, Zhang Yujie. Chinese Word Segmentation for Patent Documents [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[3] Ahmad K, Davies A, Fulford H, et al. What is a Term? The Semi-automatic Extraction of Terms from Text [A].//Snell-Hornby M, Pöchhacker F, Kaindl K. Translation Studies: An Interdiscipline [M]. Amsterdam: John Benjamins Publishing Company, 1994: 267-278.
[4] 翟笃风, 刘柏嵩. 政务领域本体术语的自动抽取[J]. 现代图书情报技术, 2010(4): 59-65. (Zhai Dufeng, Liu Baisong. Automatic Domain-specific Term Extraction in Administrative-domain Ontology [J]. New Technology of Library and Information Service, 2010(4): 59-65.)
[5] Pantel P, Lin D. A Statistical Corpus-based Term Extractor [C]. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence (AI'01). London: Springer-Verlag, 2001: 36-46.
[6] 刘桃, 刘秉权, 徐志明, 等. 领域术语自动抽取及其在文本分类中的应用[J]. 电子学报, 2007, 35(2): 328-332. (Liu Tao, Liu Bingquan, Xu Zhiming, et al. Automatic Domain- Specific Term Extraction and Its Application in Text Classification [J]. Acta Electronica Sinica, 2007, 35(2): 328-332.)
[7] 施水才, 王锴, 韩艳铧, 等. 基于条件随机场的领域术语识别研究[J]. 计算机工程与应用, 2013, 49(10): 147-149. (Shi Shuicai, Wang Kai, Han Yanhua, et al. Terminology Recognition Based on Conditional Random Fields [J]. Computer Engineering and Applications, 2013, 49(10): 147-149.)
[8] 岑咏华, 韩哲, 季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008(12):54-58. (Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model [J]. New Technology of Library and Information Service, 2008(12): 54-58.)
[9] 荀恩东, 李晟. 采用术语定义模式和多特征的新术语及定义识别方法[J]. 计算机研究与发展, 2009, 46(1): 62-69. (Xun Endong, Li Cheng. Applying Terminology Definition Pattern and Multiple Features to Identify Technical New Term and Its Definition [J]. Journal of Computer Research and Development, 2009, 46(1): 62-69.)
[10] Berger A L, Pietray S A D, Pietray V J D. A Maximum Entropy Approach to Natural Language Processing [J]. Computational Linguistics, 1996, 22(1): 39-71.
[11] 潘正高. 基于规则和统计相结合的中文命名实体识别研究[J]. 情报科学, 2012, 30(5): 708-712, 786. (Pan Zhenggao. Research on the Recognition of Chinese Named Entity Based on Rules and Statistics [J]. Information Science, 2012, 30(5): 708-712, 786.)
[12] 孙海霞, 李军莲, 吴英杰, 等. 基于混合策略的中文生物医学领域未登录词识别研究[J]. 现代图书情报技术, 2013(1): 15-21. (Sun Haixia, Li Junlian, Wu Yingjie, et al. The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method [J]. New Technology of Library and Information Service, 2013(1): 15-21.)
[13] 黄诗琳, 郑小林, 陈德人. 针对产品命名实体识别的半监督学习方法[J]. 北京邮电大学学报, 2013, 36(2): 20-23,54. (Huang Shilin, Zheng Xiaolin, Chen Deren. A Semi- Supervised Learning Method for Product Named Entity Recognition [J]. Journal of Beijing University of Posts and Telecommunications, 2013, 36(2): 20-23, 54.)
[14] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML'01). San Francisco: Morgan Kaufmann Publishers Inc., 2001: 282-289.
[15] Lee Y, Kim M, Lee J. Chunking Using Conditional Random Fields in Korean Texts [C]. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP'05). Berlin, Heidelberg: Springer- Verlag, 2005: 155-164.
[16] Ram R V S, Devi S L. Clause Boundary Identification Using Conditional Random Fields[C]. In: Proceedings of CICLing 2008, Haifa, Israel. Springer Berlin Heidelberg, 2008: 140-150.
[17] Zhou D, He Y. Learning Conditional Random Fields from Unaligned Data for Natural Language Understanding [C]. In: Proceedings of the 33rd European Conference on Advances in Information Retrieval (ECIR'11). Berlin, Heidelberg: Springer-Verlag, 2011:283-288.
[18] Zheng L, Lv X, Liu K, et al. Recognition of Chinese Personal Names Based on CRFs and Law of Names [C]. In: Proceedings of the 14th International Conference on Web Technologies and Applications (APWeb'12). Berlin, Heidelberg: Springer-Verlag, 2012:163-170.
[19] ICTCLAS汉语分词系统. ICTCLAS特色[EB/OL]. [2014-08-16]. http://www.ictclas.org/ictclas_feature.html. (ICTCLAS. Characteristic of ICTCLAS [EB/OL]. [2014-08-16]. http://www.ictclas.org/ictclas_feature.html.)
[20] 刘群, 张华平, 张浩. 计算所汉语词性标记集 Version3.0 [EB/OL]. [2014-08-16]. http://www.ictclas.org/docs/ICTPOS3.0汉语词性标记集.doc. (Liu Qun, Zhang Huaping, Zhang Hao. POS Tag Set of ICT Version3.0 [EB/OL]. [2014-08-16]. http://www.ictclas.org/docs/ICTPOS3.0汉语词性标记集. doc.)
[21] CRF++: Yet Another CRF Toolkit [EB/OL]. [2013-07-15]. http://crfpp.googlecode.com/svn/trunk/doc/index.html.
[22] 于江德, 王希杰, 樊孝忠. 基于最大熵模型的词位标注汉语分词[J]. 郑州大学学报: 理学版, 2011, 43(l): 70-74. (Yu Jiangde, Wang Xijie, Fan Xiaozhong. Chinese Word Segmentation via Word Position Tagging Based on Maximum Entropy Model [J]. Journal of Zhengzhou University: Natural Science Edition, 2011, 43(1): 70-74.)
[23] Tseng H, Chang P, Andrew G, et al. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005[C]. In: Proceedings of the 4th Sighan Workshop on Chinese Language Processing, Korea. 2005:168-171.
[24] 许晓丽, 卢志茂, 张格森. 基于条件随机场的中文命名实体识别研究[J]. 中国新技术新产品, 2009(2): 15. (Xu Xiaoli, Lu Zhimao, Zhang Gesen. Study on Conditional Random Fields Based Chinese Named Entity Recognition [J]. China New Technologies and Products, 2009(2): 15.)
[25] Zhao H, Huang C, Li M. An Improved Chinese Word Segmentation System with Conditional Random Field [C]. In: Proceedings of the 5th Sighan Workshop on Chinese Language Processing, Sydney, Australia. 2006: 108-117.