Abstract:An improved C-value term extraction method is introduced in the paper. Firstly, the domain-specific text corpora is preprocessed by stop word list. Secondly, a term extraction algorithm based on the co-occurrence frequency of multi-character is applied to get candidate terms. Lastly, term selection is completed based on termhood computed by IC-value which is the improvement of C-value in terms of inverse document frequency, meaningless substring and term length. Empirical study is conducted based on 1 000 abstracts of articles about Hepatitis B. The results indicate the proposed IC-value is much better than C-value, TF-IDF and V-value in both precision and recall. And IC-value also has good performance in long term extraction and it is very effective in filtering meaningless substring.
胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013, 29(2): 24-29.
Hu Apei, Zhang Jing, Liu Junli. Chinese Term Extraction Based on Improved C-value Method. New Technology of Library and Information Service, 2013, 29(2): 24-29.
[1] 冯志伟.现代术语学引论[M].北京:语文出版社,1997.(Feng Zhiwei. An Introduction to Modern Terminology[M].Beijing: Language & Culture Press, 1997.) [2] 王强军,李芸,张普.信息技术领域术语提取的初步研究[J].术语标准化与信息技术,2003(1):32-34.(Wang Qiangjun, Li Yun, Zhang Pu. Automatic Term Extraction in the Field of Information Technology[J]. Terminology Standardization and Information Technology, 2003(1):32-34.) [3] 安纪霞,李锡祚,宋冰,等.服务于词典编纂的特定领域专业术语自动抽取[J].计算机与数字工程,2007, 35(11):53-56.(An Jixia, Li Xizuo, Song Bing, et al. Service in Dictionary Compilation of Specific Areas of Professional Term Automatic Extraction[J]. Computer and Digital Engineering, 2007, 35(11):53-56.) [4] Foo J, Merkel M. Using Machine Learning to Perform Automatic Term Recognition[C].In:Proceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, Valletta. 2010:49-54. [5] Krauthammer M, Nenadic G. Term Identification in the Biomedical Literature[J].Journal of Biomedical Informatics, 2004, 37(6):512-526. [6] Kageura K, Umino B. Methods of Automatic Term Recognition: A Review[J].Terminology, 1996, 3(2):259-289. [7] 潘虹,徐朝军.LCS算法在术语抽取中的应用研究[J].情报学报,2010,29(5):853-857.(Pan Hong, Xu Chaojun. Application of LCS-based Algorithm in Chinese Term Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(5):853-857.) [8] Damerau F J. Generating and Evaluating Domain-oriented Multi-word Terms from Texts[J]. Information Processing & Management, 1993,29(4):433-447. [9] 张锋,许云,侯艳,等.基于互信息的中文术语抽取系统[J].计算机应用研究,2005,22(5):72-74.(Zhang Feng, Xu Yun, Hou Yan, et al. Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005, 22(5):72-74.) [10] Gelbukh A, Sidorov G, Lavin-Villa E, et al. Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus[C].In: Proceedings of the Natural Language Processing and Information Systems, and the 15th International Conference on Applications of Natural Language to Information Systems. Berlin, Heidelberg: Springer-Verlag,2010:248-255. [11] 周浪,史树敏,冯冲,等.基于多策略融合的中文术语抽取方法[J].情报学报,2010,29(3):460-467.(Zhou Lang, Shi Shumin, Feng Chong, et al. A Chinese Term Extraction System Based on Multi-Strategies Integration[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(3):460-467.) [12] 岑咏华,韩哲,季培培.基于隐马尔科夫模型的中文术语识别研究[J].现代图书情报技术,2008(12):54-58.(Cen Yonghua, Han Zhe, Ji Peipei. Chinese Term Recognition Based on Hidden Markov Model[J].New Technology of Library and Information Service, 2008(12):54-58.) [13] Frantzi K, Ananiadou S, Mima H. Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J].International Journal on Digital Libraries, 2000,3(2):115-130. [14] 中英文混合停用词表[EB/OL].[2012-11-20].http://www.smartpeer.net/myfiles/stopwords-utf8.txt.(A Mixture of English and Chinese Stoplist[EB/OL].[2012-11-20].http://www.smartpeer.net/myfiles/stopwords-utf8.txt.) [15] 许德山,张智雄,王峰,等.上下文分析与统计特征相结合的英文术语抽取研究[J].现代图书情报技术,2010(12):28-32.(Xu Deshan, Zhang Zhixiong, Wang Feng, et al. English Term Extraction Based on Context Analysis & Statistical Characteristic[J]. New Technology of Library and Information Service, 2010(12):28-32.) [16] 李超,王会珍,朱慕华,等.基于领域类别信息C-value的多词串自动抽取[J].中文信息学报,2010,24(1):94-98.(Li Chao,Wang Huizhen,Zhu Muhua, et al. Exploiting Domain Interdependence for Multi-Word Terms Extraction[J]. Journal of Chinese Information Processing, 2010,24(1):94-98.) [17] 韩红旗,朱东华,汪雪锋.专利技术术语的抽取方法[J].情报学报,2011,30(12): 1280-1284.(Han Hongqi, Zhu Donghua, Wang Xuefeng. Technical Term Extraction Method for Patent Document[J]. Journal of the China Society for Scientific and Technical Information, 2011,30(12): 1280-1284.)