Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. 29 Issue (2): 24-29     https://doi.org/10.11925/infotech.1003-3513.2013.02.04
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于改进C-value方法的中文术语抽取
胡阿沛, 张静, 刘俊丽
中国科学技术信息研究所 北京 100038
Chinese Term Extraction Based on Improved C-value Method
Hu Apei, Zhang Jing, Liu Junli
Institute of Scientific & Technical Information of China, Beijing 100038, China
全文: PDF (536 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 提出一种改进C-value的术语抽取方法,即IC-value方法。利用停用词对文本进行预处理后,采用一种基于串频统计的抽取算法提取候选术语;对候选术语进行语言规则过滤;从逆文档频率、破碎子串和术语长度三个方面改进C-value方法得到IC-value方法,并用来计算候选术语的术语度。以1 000篇乙型肝炎相关论文摘要进行实证研究,结果证明IC-value方法在准确率和召回率方面都要优于C-value、TF-IDF和V-value,有较强的长术语发现能力,且识别破碎子串的效果十分明显。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
胡阿沛
张静
刘俊丽
关键词 术语抽取串频统计语言规则术语度    
Abstract:An improved C-value term extraction method is introduced in the paper. Firstly, the domain-specific text corpora is preprocessed by stop word list. Secondly, a term extraction algorithm based on the co-occurrence frequency of multi-character is applied to get candidate terms. Lastly, term selection is completed based on termhood computed by IC-value which is the improvement of C-value in terms of inverse document frequency, meaningless substring and term length. Empirical study is conducted based on 1 000 abstracts of articles about Hepatitis B. The results indicate the proposed IC-value is much better than C-value, TF-IDF and V-value in both precision and recall. And IC-value also has good performance in long term extraction and it is very effective in filtering meaningless substring.
Key wordsTerm extraction    Statistics of string frequency    Linguistical rules    Termhood
收稿日期: 2013-01-04      出版日期: 2013-04-24
:  TP391.1  
通讯作者: 胡阿沛,huap2011@istic.ac.cn     E-mail: huap2011@istic.ac.cn
引用本文:   
胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013, 29(2): 24-29.
Hu Apei, Zhang Jing, Liu Junli. Chinese Term Extraction Based on Improved C-value Method. New Technology of Library and Information Service, 2013, 29(2): 24-29.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.02.04      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V29/I2/24
[1] 冯志伟.现代术语学引论[M].北京:语文出版社,1997.(Feng Zhiwei. An Introduction to Modern Terminology[M].Beijing: Language & Culture Press, 1997.)
[2] 王强军,李芸,张普.信息技术领域术语提取的初步研究[J].术语标准化与信息技术,2003(1):32-34.(Wang Qiangjun, Li Yun, Zhang Pu. Automatic Term Extraction in the Field of Information Technology[J]. Terminology Standardization and Information Technology, 2003(1):32-34.)
[3] 安纪霞,李锡祚,宋冰,等.服务于词典编纂的特定领域专业术语自动抽取[J].计算机与数字工程,2007, 35(11):53-56.(An Jixia, Li Xizuo, Song Bing, et al. Service in Dictionary Compilation of Specific Areas of Professional Term Automatic Extraction[J]. Computer and Digital Engineering, 2007, 35(11):53-56.)
[4] Foo J, Merkel M. Using Machine Learning to Perform Automatic Term Recognition[C].In:Proceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, Valletta. 2010:49-54.
[5] Krauthammer M, Nenadic G. Term Identification in the Biomedical Literature[J].Journal of Biomedical Informatics, 2004, 37(6):512-526.
[6] Kageura K, Umino B. Methods of Automatic Term Recognition: A Review[J].Terminology, 1996, 3(2):259-289.
[7] 潘虹,徐朝军.LCS算法在术语抽取中的应用研究[J].情报学报,2010,29(5):853-857.(Pan Hong, Xu Chaojun. Application of LCS-based Algorithm in Chinese Term Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(5):853-857.)
[8] Damerau F J. Generating and Evaluating Domain-oriented Multi-word Terms from Texts[J]. Information Processing & Management, 1993,29(4):433-447.
[9] 张锋,许云,侯艳,等.基于互信息的中文术语抽取系统[J].计算机应用研究,2005,22(5):72-74.(Zhang Feng, Xu Yun, Hou Yan, et al. Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005, 22(5):72-74.)
[10] Gelbukh A, Sidorov G, Lavin-Villa E, et al. Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus[C].In: Proceedings of the Natural Language Processing and Information Systems, and the 15th International Conference on Applications of Natural Language to Information Systems. Berlin, Heidelberg: Springer-Verlag,2010:248-255.
[11] 周浪,史树敏,冯冲,等.基于多策略融合的中文术语抽取方法[J].情报学报,2010,29(3):460-467.(Zhou Lang, Shi Shumin, Feng Chong, et al. A Chinese Term Extraction System Based on Multi-Strategies Integration[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(3):460-467.)
[12] 岑咏华,韩哲,季培培.基于隐马尔科夫模型的中文术语识别研究[J].现代图书情报技术,2008(12):54-58.(Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model[J].New Technology of Library and Information Service, 2008(12):54-58.)
[13] Frantzi K, Ananiadou S, Mima H. Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J].International Journal on Digital Libraries, 2000,3(2):115-130.
[14] 中英文混合停用词表[EB/OL].[2012-11-20].http://www.smartpeer.net/myfiles/stopwords-utf8.txt.(A Mixture of English and Chinese Stoplist[EB/OL].[2012-11-20].http://www.smartpeer.net/myfiles/stopwords-utf8.txt.)
[15] 许德山,张智雄,王峰,等.上下文分析与统计特征相结合的英文术语抽取研究[J].现代图书情报技术,2010(12):28-32.(Xu Deshan, Zhang Zhixiong, Wang Feng, et al. English Term Extraction Based on Context Analysis & Statistical Characteristic[J]. New Technology of Library and Information Service, 2010(12):28-32.)
[16] 李超,王会珍,朱慕华,等.基于领域类别信息C-value的多词串自动抽取[J].中文信息学报,2010,24(1):94-98.(Li Chao,Wang Huizhen,Zhu Muhua, et al. Exploiting Domain Interdependence for Multi-Word Terms Extraction[J]. Journal of Chinese Information Processing, 2010,24(1):94-98.)
[17] 韩红旗,朱东华,汪雪锋.专利技术术语的抽取方法[J].情报学报,2011,30(12): 1280-1284.(Han Hongqi, Zhu Donghua, Wang Xuefeng. Technical Term Extraction Method for Patent Document[J]. Journal of the China Society for Scientific and Technical Information, 2011,30(12): 1280-1284.)
[1] 刘浏,秦天允,王东波. 非物质文化遗产传统音乐术语自动抽取*[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[2] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[3] 姜霖,王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究*[J]. 现代图书情报技术, 2016, 32(2): 9-15.
[4] 何宇, 吕学强, 徐丽萍. 新能源汽车领域中文术语抽取方法[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[5] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[6] 唐守利, 徐宝祥. 基于本体的云服务语义检索系统研究[J]. 现代图书情报技术, 2014, 30(12): 27-35.
[7] 汤青,吕学强,李卓,施水才,. 领域本体术语抽取研究*[J]. 现代图书情报技术, 2014, 30(1): 43-50.
[8] 熊李艳, 谭龙, 钟茂生. 基于有效词频的改进C-value自动术语抽取方法[J]. 现代图书情报技术, 2013, 29(9): 54-59.
[9] 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013, (6): 68-75.
[10] 李振清, 刘建毅, 王枞, 吴旭. 同行评议专家遴选系统研究与实现[J]. 现代图书情报技术, 2012, 28(5): 81-86.
[11] 康小丽, 章成志. 用于双语术语抽取的专业领域中英文可比语料库构建[J]. 现代图书情报技术, 2012, 28(2): 28-33.
[12] 柯修, 王惠临, 于薇. 基于串频统计的汉语和孟加拉语专有名词识别[J]. 现代图书情报技术, 2011, 27(12): 31-38.
[13] 许德山, 张智雄, 王峰, 邢美凤. 上下文分析与统计特征相结合的英文术语抽取研究[J]. 现代图书情报技术, 2010, 26(12): 28-33.
[14] 康小丽,章成志,王惠临. 基于可比语料库的双语术语抽取研究述评*[J]. 现代图书情报技术, 2009, (10): 7-13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn