基于隐马尔科夫模型的中文术语识别研究

doi:10.11925/infotech.1003-3513.2008.12.10

现代图书情报技术

2008, Vol. 24

Issue (12): 54-58 https://doi.org/10.11925/infotech.1003-3513.2008.12.10

情报分析与研究

本期目录 | 过刊浏览 | 高级检索

基于隐马尔科夫模型的中文术语识别研究

岑咏华^1,2 韩哲² 季培培^3,4

¹（南京理工大学经济管理学院南京 210094）
²（南京大学信息管理系南京 210093）
³（中国科学院国家科学图书馆北京 100190）
⁴（中国科学院研究生院北京 100049）

Chinese Term Recognition Based on Hidden Markov Model

Cen Yonghua ^1,2 Han Zhe ²Ji Peipei ^3,4

¹（School of Economics and Management,Nanjing University of Science & Technology,Nanjing 210094,China）
²(Department of Information Management,Nanjing University,Nanjing 210093,China)
³(National Science Library, Chinese Academy of Sciences, Beijing 100190,China)
⁴（Graduate University of Chinese Academy of Sciences, Beijing 100049, China）

摘要
参考文献
相关文章
Metrics

全文: PDF (649 KB)
输出: BibTeX | EndNote (RIS)

摘要

基于对中文文本信息语法构成尤其是词性搭配的概率特征的分析，提出一种基于双层隐马尔科夫模型的中文泛术语识别和提取的思路和系统框架，并实现相关系统，基于训练语料对多个领域的文本信息进行术语提取测试。实验结果表明，所提出的基于隐马尔科夫模型的中文泛术语识别和提取思想具有较好的实践参考意义。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	岑咏华
	季培培
	韩哲

关键词 ：中文术语识别和提取, 隐马尔科夫, HMM

Abstract：

After a perceptive analysis of probabilistic characteristics of syntax composition especially POS matching of Chinese textual information, a system framework for Chinese term recognition and extraction based on dual layer HMM is presented and implemented. The method proposed shows a good performance in the tests with textual information from different domain, and the terms recognized and extracted by the implemented system can be treated as candidate terms for false-eliminating and optimizing combining with parameters of mutual information, log likelihood and domain dependency.

Key words： Chinese term recognition Hidden markov model HMM

收稿日期: 2008-08-13 出版日期: 2008-12-25

:	TP391
	G358
	H031

通讯作者: 岑咏华 E-mail: yhcen@163.com

作者简介: 岑咏华,韩哲,季培培

引用本文:

岑咏华,韩哲,季培培. 基于隐马尔科夫模型的中文术语识别研究[J]. 现代图书情报技术, 2008, 24(12): 54-58.
Cen Yonghua,Han Zhe,Ji Peipei . Chinese Term Recognition Based on Hidden Markov Model. New Technology of Library and Information Service, 2008, 24(12): 54-58.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2008.12.10 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2008/V24/I12/54

［1］ Autonomy Website［EB/OL］.［2008-05-12］.http://www.autonomy.com/.
［2］ Lafferty J, McCallum A, Pereira F.Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence data［A］. In:Proceeding 18th International Conference on Machine Learning［C］, Morgan Kaufmann, San Francisco, CA, 2001:282-289.
［3］ Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition［A］.In:Proceedings of the IEEE［C］,1989,77(2):257-286.
［4］ Berger A L, Stephen A Della Pietray, Vincent J Della Pietray. A Maximum Entropy Approach to Natural Language Processing［J］.Computational Linguistics,1996,22(1):1-36.
［5］ Takeuchi K，Nigel Collier.Use of Support Vector Machines in Extended Named Entity Recognition［A］.In:Proceedings of the 6th Conference on Natural Language Learning［C］.Taipei, Taiwan,2002:119-125.
［6］周俊生,戴新宇,尹存燕，等. 基于层叠条件随机场模型的中文机构名自动识别［J］.电子学报,2006,34(5):804-809.
［7］史树敏,王志强,周良，等.基于条件随机域的中文命名实体识别［A］. 见：第三届学生计算语言学研讨会论文集［C］, 辽宁,沈阳,2006.
［8］俞鸿魁,张华平，刘群，等.基于层叠隐马尔可夫模型的中文命名实体识别［J］.通信学报，2006,27 (2):87-94.
［9］何楠,毛新年,等. 一种两阶段的中文命名实体识别方法［A］. 见：中国计算技术与语言问题研究——第七届中文信息处理国际会议论文集［C］, 北京：电子工业出版社出版,2007.
［10］刘建舟. 术语自动抽取系统的设计及关键技术研究［D］.武汉：华中师范大学,2004.
［11］贺敏,龚才春,张华平，等.一种基于大规模语料的新词识别方法［J］.计算机工程与应用,2007,43( 21):157- 159.
［12］张锋,许云,侯艳,等. 基于互信息的中文术语抽取系统［J］. 计算机应用研究,2005，22(5):72-73,77.
［13］ Zhang F, Xu Y, Hou Y, et al.Chinese Term Extraction System Based on Mutual Information. Application Research of Computers［J］, 2005(5):72-73,77.
［14］中科院计算所汉语词性标记集［EB/OL］. ［2008-05-12］.http://ictclas.org/docs/ICTPOS3.0汉语词性标记集.doc.

[1]	刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取^*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[2]	麦范金,王挺. 基于双向最大匹配和HMM的分词消歧模型*[J]. 现代图书情报技术, 2008, 24(8): 37-41.
[3]	王昊,邓三鸿. HMM和CRFs在信息抽取应用中的比较研究[J]. 现代图书情报技术, 2007, 2(12): 57-63.

Viewed

Full text

Abstract

Cited

Shared

Discussed