Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (2): 9-15    DOI: 10.11925/infotech.1003-3513.2016.02.02
Orginal Article Current Issue | Archive | Adv Search |
Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model
Jiang Lin1,2,Wang Dongbo3()
1 School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF(880 KB)   HTML ( 79
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to extract domain terms more accurately and conveniently. [Methods] First, proposed a method using the CBOW model to build word vectors for each component of the terms. Then, applied the cosine similarity to calculate the internal correlation degree among each term’s individual components. To get more representative terms, we used the PageRank algorithm to rank the candidates. [Results] We obtained high recall and precision rates using the paper abstacts in the field of natural language processing as the training pool. [Limitations] The training pool was relatively small, which might influence the results. [Conclusions] This study shows that CBOW model is a more appropriate method to extract terminologies.

Key wordsTerminology extraction      Neural network      Continuous Bag-of-Words Model     
Received: 06 September 2015      Published: 08 March 2016

Cite this article:

Jiang Lin,Wang Dongbo. Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model. New Technology of Library and Information Service, 2016, 32(2): 9-15.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.02.02     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I2/9

[1] 吴云芳, 穗志方, 邱利坤, 等. 信息科学与技术领域术语部件描述[J]. 语言文字应用, 2003(4): 34-39.
[1] (Wu Yunfang, Sui Zhifang, Qiu Likun, et al.The Approaches and Strategies to Describe the Term Component in Information Science and Technology[J]. Applied Linguistics, 2003(4): 34-39.)
[2] 张榕. 术语定义抽取、聚类与术语识别研究[D]. 北京: 北京语言大学, 2006.
[2] (Zhang Rong.Research on Extraction and Clustering of Term Definition and Term Extraction [D]. Beijing: Beijing Language and Culture University, 2006.)
[3] 李芸. 信息科学和信息技术术语概念体系研究[D]. 北京: 北京语言大学, 2003.
[3] (Li Yun.Concept System of Terminology of Information Sciences and Information Technologies: A Preliminary Study [D]. Beijing: Beijing Language and Culture University, 2003.)
[4] Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases [C]. In: Proceedings of the 14th Conference on Computational Linguistics. Association for Computational Linguistics, 1992: 977-981.
[5] Justeson J S, Katz S M.Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text[J]. Natural Language Engineering, 1995, 1(1): 9-27.
[6] Ananiadou S.A Methodology for Automatic Term Recognition [C]. In: Proceedings of the 15th Conference on Computational Linguistics. Association for Computational Linguistics, 1994: 1034-1038.
[7] 张峰, 许云, 侯艳, 等. 基于互信息的中文术语抽取系统[J]. 计算机应用研究, 2005(5): 72-77.
[7] (Zhang Feng, Xu Yun, Hou Yan, et al.Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005(5): 72-77.)
[8] Frantzi K, Ananiadou S, Mima H.Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J]. International Journal on Digital Libraries, 2000, 3(2): 115-130.
[9] Manning C D, Schutze H.统计自然语言处理基础[M]. 范春法译. 第4版. 北京: 电子工业出版社, 2005: 95-97.
[9] (Manning C D, Schutze H.Foundations of Statistical Natural Language Processing [M]. Translated by Fan Chunfa. The 4th Edition. Beijing: Electronic Industry Press, 2005: 95-97.)
[10] Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition [C]. In: Proceedings of the 6th Conference on Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002: 1-7.
[11] Lafferty J D, McCallum A, Pereira F C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2001:282-289.
[12] 章成志. 基于多层术语度的一体化术语抽取研究[J]. 情报学报, 2011, 30(3): 275-285.
[12] (Zhang Chengzhi.Using Integration Strategy and Multi-level Termhood to Extract Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(3): 275-285.)
[13] Lee C M, Huang C K, Tang K M, et al.Iterative Machine-Learning Chinese Term Extraction [C]. In: Proceedings of the 14th International Conference on Asia-Pacific Digital Libraries (ICADL 2012), Taipei, China. Berlin: Springer, 2012: 309-312.
[14] 刘克强. 2009共享版ICTCLAS 的分析与使用[J]. 科教文汇, 2009(22): 271, 280.
[14] (Liu Keqiang. The Analysing and Using of the2009 Shared Version of ICTCLAS[J]. The Science Education Article Collects, 2009(22): 271, 280.)
[15] 周浪. 中文术语抽取若干问题研究[D]. 南京: 南京理工大学, 2009.
[15] (Zhou Lang.A Study on the Chinese Term Extraction [D]. Nanjing: Nanjing University of Science and Technology, 2009.)
[16] Mikolov T. Word2vec Code [CP/OL]. [2015-09-18]. .
[17] 周练. Word2vec 的工作原理及应用探究[J]. 科技情报开发与经济, 2015, 25(2): 145-148.
[17] (Zhou Lian.Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015, 25(2): 145-148.)
[18] 罗刚. 解密搜索引擎技术实战[M]. 北京: 电子工业出版社, 2011: 73-74.
[18] (Luo Gang.Actual Explaining the Technologies of the Search Engine [M]. Beijing: Electronic Industry Press, 2011: 73-74.)
[1] Zhenyu He,Xiangxiang Dong,Qinghua Zhu. Classifying Baidu Encyclopedia Entries with User Behaviors[J]. 数据分析与知识发现, 2019, 3(6): 117-122.
[2] Kan Liu,Lu Chen. Deep Neural Network Learning for Medical Triage[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[3] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[4] Yuemei Xu,Sining Lv,Lianqiao Cai,Xiaoya Zhang. Analyzing News Topic Evolution with Convolutional Neural Networks and Topic2Vec[J]. 数据分析与知识发现, 2018, 2(9): 31-41.
[5] Xiaoyu Ma,Han Zhang,Yuhong Zhao. Building Childhood Asthma Prediction Model with Artificial Neural Network and BRFSS Database[J]. 数据分析与知识发现, 2018, 2(8): 10-15.
[6] Hu Meng,Xiaobei Liang,Yixiong Yang,Min Li. Evaluating and Optimizing Supply Chains with LMBP Algorithm[J]. 数据分析与知识发现, 2018, 2(11): 37-45.
[7] Yuying Wu,Ping Sun,Xijun He,Guorui Jiang. Predicting Transactions Among Agents in Patent Transfer Weighted Networks for New Energy[J]. 数据分析与知识发现, 2018, 2(11): 73-79.
[8] Yanhui Xiao,Xin Wang,Wen’gang Feng,Huawei Tian,Shaozhong Wu,Lihua Li. Predicting Crime Locations Based on Long Short Term Memory and Convolutional Neural Networks[J]. 数据分析与知识发现, 2018, 2(10): 15-20.
[9] Xiaoxi Huang,Hanyu Li,Rongbo Wang,Xiaohua Wang,Zhiqun Chen. Recognizing Metaphor with Convolution Neural Network and SVM[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[10] Jiaheng Hu,Yonghua Cen,Chengyao Wu. Constructing Sentiment Dictionary with Deep Learning: Case Study of Financial Data[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[11] Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[12] Jing Yan,Qiang Bi,Jie Li,Fu Wang. Construction of Aggregation Quality Predicting Model for Digital Resource in Library ——Based on Improved Genetic Algorithm and BP Neural Network[J]. 数据分析与知识发现, 2017, 1(12): 49-62.
[13] Lin Jiang,Dongbo Wang. Automatically Detecting and Tagging Foreign Language Citation Metadata[J]. 数据分析与知识发现, 2017, 1(1): 47-54.
[14] Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[15] Wang Xiaoyun,Yuan Yuan,Shi Lingling. Predicting Opening Weekend Box Office Prediction Based on Microblog[J]. 现代图书情报技术, 2016, 32(4): 31-39.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn