Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model
Jiang Lin1,2,Wang Dongbo3()
1 School of Information Management, Nanjing University, Nanjing 210023, China 2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China 3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
[Objective] This study tries to extract domain terms more accurately and conveniently. [Methods] First, proposed a method using the CBOW model to build word vectors for each component of the terms. Then, applied the cosine similarity to calculate the internal correlation degree among each term’s individual components. To get more representative terms, we used the PageRank algorithm to rank the candidates. [Results] We obtained high recall and precision rates using the paper abstacts in the field of natural language processing as the training pool. [Limitations] The training pool was relatively small, which might influence the results. [Conclusions] This study shows that CBOW model is a more appropriate method to extract terminologies.
姜霖,王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究*[J]. 现代图书情报技术, 2016, 32(2): 9-15.
Jiang Lin,Wang Dongbo. Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model. New Technology of Library and Information Service, 2016, 32(2): 9-15.
(Wu Yunfang, Sui Zhifang, Qiu Likun, et al.The Approaches and Strategies to Describe the Term Component in Information Science and Technology[J]. Applied Linguistics, 2003(4): 34-39.)
[2]
张榕. 术语定义抽取、聚类与术语识别研究[D]. 北京: 北京语言大学, 2006.
[2]
(Zhang Rong.Research on Extraction and Clustering of Term Definition and Term Extraction [D]. Beijing: Beijing Language and Culture University, 2006.)
[3]
李芸. 信息科学和信息技术术语概念体系研究[D]. 北京: 北京语言大学, 2003.
[3]
(Li Yun.Concept System of Terminology of Information Sciences and Information Technologies: A Preliminary Study [D]. Beijing: Beijing Language and Culture University, 2003.)
[4]
Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases [C]. In: Proceedings of the 14th Conference on Computational Linguistics. Association for Computational Linguistics, 1992: 977-981.
[5]
Justeson J S, Katz S M.Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text[J]. Natural Language Engineering, 1995, 1(1): 9-27.
[6]
Ananiadou S.A Methodology for Automatic Term Recognition [C]. In: Proceedings of the 15th Conference on Computational Linguistics. Association for Computational Linguistics, 1994: 1034-1038.
(Zhang Feng, Xu Yun, Hou Yan, et al.Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005(5): 72-77.)
[8]
Frantzi K, Ananiadou S, Mima H.Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J]. International Journal on Digital Libraries, 2000, 3(2): 115-130.
(Manning C D, Schutze H.Foundations of Statistical Natural Language Processing [M]. Translated by Fan Chunfa. The 4th Edition. Beijing: Electronic Industry Press, 2005: 95-97.)
[10]
Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition [C]. In: Proceedings of the 6th Conference on Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002: 1-7.
[11]
Lafferty J D, McCallum A, Pereira F C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2001:282-289.
(Zhang Chengzhi.Using Integration Strategy and Multi-level Termhood to Extract Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(3): 275-285.)
[13]
Lee C M, Huang C K, Tang K M, et al.Iterative Machine-Learning Chinese Term Extraction [C]. In: Proceedings of the 14th International Conference on Asia-Pacific Digital Libraries (ICADL 2012), Taipei, China. Berlin: Springer, 2012: 309-312.