|
|
Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model |
Jiang Lin1,2,Wang Dongbo3() |
1 School of Information Management, Nanjing University, Nanjing 210023, China 2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China 3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China |
|
|
Abstract [Objective] This study tries to extract domain terms more accurately and conveniently. [Methods] First, proposed a method using the CBOW model to build word vectors for each component of the terms. Then, applied the cosine similarity to calculate the internal correlation degree among each term’s individual components. To get more representative terms, we used the PageRank algorithm to rank the candidates. [Results] We obtained high recall and precision rates using the paper abstacts in the field of natural language processing as the training pool. [Limitations] The training pool was relatively small, which might influence the results. [Conclusions] This study shows that CBOW model is a more appropriate method to extract terminologies.
|
Received: 06 September 2015
Published: 08 March 2016
|
[1] | 吴云芳, 穗志方, 邱利坤, 等. 信息科学与技术领域术语部件描述[J]. 语言文字应用, 2003(4): 34-39. | [1] | (Wu Yunfang, Sui Zhifang, Qiu Likun, et al.The Approaches and Strategies to Describe the Term Component in Information Science and Technology[J]. Applied Linguistics, 2003(4): 34-39.) | [2] | 张榕. 术语定义抽取、聚类与术语识别研究[D]. 北京: 北京语言大学, 2006. | [2] | (Zhang Rong.Research on Extraction and Clustering of Term Definition and Term Extraction [D]. Beijing: Beijing Language and Culture University, 2006.) | [3] | 李芸. 信息科学和信息技术术语概念体系研究[D]. 北京: 北京语言大学, 2003. | [3] | (Li Yun.Concept System of Terminology of Information Sciences and Information Technologies: A Preliminary Study [D]. Beijing: Beijing Language and Culture University, 2003.) | [4] | Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases [C]. In: Proceedings of the 14th Conference on Computational Linguistics. Association for Computational Linguistics, 1992: 977-981. | [5] | Justeson J S, Katz S M.Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text[J]. Natural Language Engineering, 1995, 1(1): 9-27. | [6] | Ananiadou S.A Methodology for Automatic Term Recognition [C]. In: Proceedings of the 15th Conference on Computational Linguistics. Association for Computational Linguistics, 1994: 1034-1038. | [7] | 张峰, 许云, 侯艳, 等. 基于互信息的中文术语抽取系统[J]. 计算机应用研究, 2005(5): 72-77. | [7] | (Zhang Feng, Xu Yun, Hou Yan, et al.Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005(5): 72-77.) | [8] | Frantzi K, Ananiadou S, Mima H.Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J]. International Journal on Digital Libraries, 2000, 3(2): 115-130. | [9] | Manning C D, Schutze H.统计自然语言处理基础[M]. 范春法译. 第4版. 北京: 电子工业出版社, 2005: 95-97. | [9] | (Manning C D, Schutze H.Foundations of Statistical Natural Language Processing [M]. Translated by Fan Chunfa. The 4th Edition. Beijing: Electronic Industry Press, 2005: 95-97.) | [10] | Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition [C]. In: Proceedings of the 6th Conference on Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002: 1-7. | [11] | Lafferty J D, McCallum A, Pereira F C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2001:282-289. | [12] | 章成志. 基于多层术语度的一体化术语抽取研究[J]. 情报学报, 2011, 30(3): 275-285. | [12] | (Zhang Chengzhi.Using Integration Strategy and Multi-level Termhood to Extract Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(3): 275-285.) | [13] | Lee C M, Huang C K, Tang K M, et al.Iterative Machine-Learning Chinese Term Extraction [C]. In: Proceedings of the 14th International Conference on Asia-Pacific Digital Libraries (ICADL 2012), Taipei, China. Berlin: Springer, 2012: 309-312. | [14] | 刘克强. 2009共享版ICTCLAS 的分析与使用[J]. 科教文汇, 2009(22): 271, 280. | [14] | (Liu Keqiang. The Analysing and Using of the2009 Shared Version of ICTCLAS[J]. The Science Education Article Collects, 2009(22): 271, 280.) | [15] | 周浪. 中文术语抽取若干问题研究[D]. 南京: 南京理工大学, 2009. | [15] | (Zhou Lang.A Study on the Chinese Term Extraction [D]. Nanjing: Nanjing University of Science and Technology, 2009.) | [16] | Mikolov T. Word2vec Code [CP/OL]. [2015-09-18]. . | [17] | 周练. Word2vec 的工作原理及应用探究[J]. 科技情报开发与经济, 2015, 25(2): 145-148. | [17] | (Zhou Lian.Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015, 25(2): 145-148.) | [18] | 罗刚. 解密搜索引擎技术实战[M]. 北京: 电子工业出版社, 2011: 73-74. | [18] | (Luo Gang.Actual Explaining the Technologies of the Search Engine [M]. Beijing: Electronic Industry Press, 2011: 73-74.) |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|