Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (2): 24-29    DOI: 10.11925/infotech.1003-3513.2013.02.04
Current Issue | Archive | Adv Search |
Chinese Term Extraction Based on Improved C-value Method
Hu Apei, Zhang Jing, Liu Junli
Institute of Scientific & Technical Information of China, Beijing 100038, China
Download: PDF(536 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  An improved C-value term extraction method is introduced in the paper. Firstly, the domain-specific text corpora is preprocessed by stop word list. Secondly, a term extraction algorithm based on the co-occurrence frequency of multi-character is applied to get candidate terms. Lastly, term selection is completed based on termhood computed by IC-value which is the improvement of C-value in terms of inverse document frequency, meaningless substring and term length. Empirical study is conducted based on 1 000 abstracts of articles about Hepatitis B. The results indicate the proposed IC-value is much better than C-value, TF-IDF and V-value in both precision and recall. And IC-value also has good performance in long term extraction and it is very effective in filtering meaningless substring.
Key wordsTerm extraction      Statistics of string frequency      Linguistical rules      Termhood      
Received: 04 January 2013      Published: 24 April 2013
:  TP391.1  

Cite this article:

Hu Apei, Zhang Jing, Liu Junli. Chinese Term Extraction Based on Improved C-value Method. New Technology of Library and Information Service, 2013, 29(2): 24-29.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.02.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I2/24

[1] 冯志伟.现代术语学引论[M].北京:语文出版社,1997.(Feng Zhiwei. An Introduction to Modern Terminology[M].Beijing: Language & Culture Press, 1997.)
[2] 王强军,李芸,张普.信息技术领域术语提取的初步研究[J].术语标准化与信息技术,2003(1):32-34.(Wang Qiangjun, Li Yun, Zhang Pu. Automatic Term Extraction in the Field of Information Technology[J]. Terminology Standardization and Information Technology, 2003(1):32-34.)
[3] 安纪霞,李锡祚,宋冰,等.服务于词典编纂的特定领域专业术语自动抽取[J].计算机与数字工程,2007, 35(11):53-56.(An Jixia, Li Xizuo, Song Bing, et al. Service in Dictionary Compilation of Specific Areas of Professional Term Automatic Extraction[J]. Computer and Digital Engineering, 2007, 35(11):53-56.)
[4] Foo J, Merkel M. Using Machine Learning to Perform Automatic Term Recognition[C].In:Proceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, Valletta. 2010:49-54.
[5] Krauthammer M, Nenadic G. Term Identification in the Biomedical Literature[J].Journal of Biomedical Informatics, 2004, 37(6):512-526.
[6] Kageura K, Umino B. Methods of Automatic Term Recognition: A Review[J].Terminology, 1996, 3(2):259-289.
[7] 潘虹,徐朝军.LCS算法在术语抽取中的应用研究[J].情报学报,2010,29(5):853-857.(Pan Hong, Xu Chaojun. Application of LCS-based Algorithm in Chinese Term Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(5):853-857.)
[8] Damerau F J. Generating and Evaluating Domain-oriented Multi-word Terms from Texts[J]. Information Processing & Management, 1993,29(4):433-447.
[9] 张锋,许云,侯艳,等.基于互信息的中文术语抽取系统[J].计算机应用研究,2005,22(5):72-74.(Zhang Feng, Xu Yun, Hou Yan, et al. Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005, 22(5):72-74.)
[10] Gelbukh A, Sidorov G, Lavin-Villa E, et al. Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus[C].In: Proceedings of the Natural Language Processing and Information Systems, and the 15th International Conference on Applications of Natural Language to Information Systems. Berlin, Heidelberg: Springer-Verlag,2010:248-255.
[11] 周浪,史树敏,冯冲,等.基于多策略融合的中文术语抽取方法[J].情报学报,2010,29(3):460-467.(Zhou Lang, Shi Shumin, Feng Chong, et al. A Chinese Term Extraction System Based on Multi-Strategies Integration[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(3):460-467.)
[12] 岑咏华,韩哲,季培培.基于隐马尔科夫模型的中文术语识别研究[J].现代图书情报技术,2008(12):54-58.(Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model[J].New Technology of Library and Information Service, 2008(12):54-58.)
[13] Frantzi K, Ananiadou S, Mima H. Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J].International Journal on Digital Libraries, 2000,3(2):115-130.
[14] 中英文混合停用词表[EB/OL].[2012-11-20].http://www.smartpeer.net/myfiles/stopwords-utf8.txt.(A Mixture of English and Chinese Stoplist[EB/OL].[2012-11-20].http://www.smartpeer.net/myfiles/stopwords-utf8.txt.)
[15] 许德山,张智雄,王峰,等.上下文分析与统计特征相结合的英文术语抽取研究[J].现代图书情报技术,2010(12):28-32.(Xu Deshan, Zhang Zhixiong, Wang Feng, et al. English Term Extraction Based on Context Analysis & Statistical Characteristic[J]. New Technology of Library and Information Service, 2010(12):28-32.)
[16] 李超,王会珍,朱慕华,等.基于领域类别信息C-value的多词串自动抽取[J].中文信息学报,2010,24(1):94-98.(Li Chao,Wang Huizhen,Zhu Muhua, et al. Exploiting Domain Interdependence for Multi-Word Terms Extraction[J]. Journal of Chinese Information Processing, 2010,24(1):94-98.)
[17] 韩红旗,朱东华,汪雪锋.专利技术术语的抽取方法[J].情报学报,2011,30(12): 1280-1284.(Han Hongqi, Zhu Donghua, Wang Xuefeng. Technical Term Extraction Method for Patent Document[J]. Journal of the China Society for Scientific and Technical Information, 2011,30(12): 1280-1284.)
[1] He Yu, Lv Xueqiang, Xu Liping. A Chinese Term Extraction System in New Energy Vehicles Domain[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[2] Tang Shouli, Xu Baoxiang. Research on Ontology-based Cloud Services Semantic Retrieval System[J]. 现代图书情报技术, 2014, 30(12): 27-35.
[3] Tang Qing,Lv Xueqiang,Li Zhuo,Shi Shuicai,. Research on Domain Ontology Term Extraction[J]. 现代图书情报技术, 2014, 30(1): 43-50.
[4] Xiong Liyan, Tan Long, Zhong Maosheng. An Automatic Term Extraction System of Improved C-value Based on Effective Word Frequency[J]. 现代图书情报技术, 2013, 29(9): 54-59.
[5] Li Zhenqing, Liu Jianyi, Wang Cong, Wu Xu. Research and Implementation of Peer-review Experts Selection System[J]. 现代图书情报技术, 2012, 28(5): 81-86.
[6] Wang Hao . Named Entity Extraction Model Based on Hierarchical Pattern Matching[J]. 现代图书情报技术, 2007, 2(5): 62-68.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn