Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (2): 9-15    DOI: 10.11925/infotech.1003-3513.2016.02.02
Orginal Article Current Issue | Archive | Adv Search |
Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model
Jiang Lin1,2,Wang Dongbo3()
1 School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to extract domain terms more accurately and conveniently. [Methods] First, proposed a method using the CBOW model to build word vectors for each component of the terms. Then, applied the cosine similarity to calculate the internal correlation degree among each term’s individual components. To get more representative terms, we used the PageRank algorithm to rank the candidates. [Results] We obtained high recall and precision rates using the paper abstacts in the field of natural language processing as the training pool. [Limitations] The training pool was relatively small, which might influence the results. [Conclusions] This study shows that CBOW model is a more appropriate method to extract terminologies.

Key wordsTerminology extraction      Neural network      Continuous Bag-of-Words Model     
Received: 06 September 2015      Published: 08 March 2016

Cite this article:

Jiang Lin,Wang Dongbo. Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model. New Technology of Library and Information Service, 2016, 32(2): 9-15.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.02.02     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I2/9

[1] 吴云芳, 穗志方, 邱利坤, 等. 信息科学与技术领域术语部件描述[J]. 语言文字应用, 2003(4): 34-39.
[1] (Wu Yunfang, Sui Zhifang, Qiu Likun, et al.The Approaches and Strategies to Describe the Term Component in Information Science and Technology[J]. Applied Linguistics, 2003(4): 34-39.)
[2] 张榕. 术语定义抽取、聚类与术语识别研究[D]. 北京: 北京语言大学, 2006.
[2] (Zhang Rong.Research on Extraction and Clustering of Term Definition and Term Extraction [D]. Beijing: Beijing Language and Culture University, 2006.)
[3] 李芸. 信息科学和信息技术术语概念体系研究[D]. 北京: 北京语言大学, 2003.
[3] (Li Yun.Concept System of Terminology of Information Sciences and Information Technologies: A Preliminary Study [D]. Beijing: Beijing Language and Culture University, 2003.)
[4] Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases [C]. In: Proceedings of the 14th Conference on Computational Linguistics. Association for Computational Linguistics, 1992: 977-981.
[5] Justeson J S, Katz S M.Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text[J]. Natural Language Engineering, 1995, 1(1): 9-27.
[6] Ananiadou S.A Methodology for Automatic Term Recognition [C]. In: Proceedings of the 15th Conference on Computational Linguistics. Association for Computational Linguistics, 1994: 1034-1038.
[7] 张峰, 许云, 侯艳, 等. 基于互信息的中文术语抽取系统[J]. 计算机应用研究, 2005(5): 72-77.
[7] (Zhang Feng, Xu Yun, Hou Yan, et al.Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005(5): 72-77.)
[8] Frantzi K, Ananiadou S, Mima H.Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J]. International Journal on Digital Libraries, 2000, 3(2): 115-130.
[9] Manning C D, Schutze H.统计自然语言处理基础[M]. 范春法译. 第4版. 北京: 电子工业出版社, 2005: 95-97.
[9] (Manning C D, Schutze H.Foundations of Statistical Natural Language Processing [M]. Translated by Fan Chunfa. The 4th Edition. Beijing: Electronic Industry Press, 2005: 95-97.)
[10] Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition [C]. In: Proceedings of the 6th Conference on Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002: 1-7.
[11] Lafferty J D, McCallum A, Pereira F C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2001:282-289.
[12] 章成志. 基于多层术语度的一体化术语抽取研究[J]. 情报学报, 2011, 30(3): 275-285.
[12] (Zhang Chengzhi.Using Integration Strategy and Multi-level Termhood to Extract Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(3): 275-285.)
[13] Lee C M, Huang C K, Tang K M, et al.Iterative Machine-Learning Chinese Term Extraction [C]. In: Proceedings of the 14th International Conference on Asia-Pacific Digital Libraries (ICADL 2012), Taipei, China. Berlin: Springer, 2012: 309-312.
[14] 刘克强. 2009共享版ICTCLAS 的分析与使用[J]. 科教文汇, 2009(22): 271, 280.
[14] (Liu Keqiang. The Analysing and Using of the2009 Shared Version of ICTCLAS[J]. The Science Education Article Collects, 2009(22): 271, 280.)
[15] 周浪. 中文术语抽取若干问题研究[D]. 南京: 南京理工大学, 2009.
[15] (Zhou Lang.A Study on the Chinese Term Extraction [D]. Nanjing: Nanjing University of Science and Technology, 2009.)
[16] Mikolov T. Word2vec Code [CP/OL]. [2015-09-18]. .
[17] 周练. Word2vec 的工作原理及应用探究[J]. 科技情报开发与经济, 2015, 25(2): 145-148.
[17] (Zhou Lian.Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015, 25(2): 145-148.)
[18] 罗刚. 解密搜索引擎技术实战[M]. 北京: 电子工业出版社, 2011: 73-74.
[18] (Luo Gang.Actual Explaining the Technologies of the Search Engine [M]. Beijing: Electronic Industry Press, 2011: 73-74.)
[1] Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao. Predicting Drug ADMET Properties Based on Graph Attention Network[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[2] Zhang Le, Leng Jidong, Lv Xueqiang, Cui Zhuo, Wang Lei, You Xindong. RLCPAR: A Rewriting Model for Chinese Patent Abstracts Based on Reinforcement Learning[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[3] Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[4] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[5] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[6] Ding Hao, Ai Wenhua, Hu Guangwei, Li Shuqing, Suo Wei. A Personalized Recommendation Model with Time Series Fluctuation of User Interest[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[7] Yin Haoran,Cao Jinxuan,Cao Luzhe,Wang Guodong. Identifying Emergency Elements Based on BiGRU-AM Model with Extended Semantic Dimension[J]. 数据分析与知识发现, 2020, 4(9): 91-99.
[8] Qiu Erli,He Hongwei,Yi Chengqi,Li Huiying. Research on Public Policy Support Based on Character-level CNN Technology[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[9] Liu Weijiang,Wei Hai,Yun Tianhe. Evaluation Model for Customer Credits Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[10] Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[11] Yan Chun,Liu Lu. Classifying Non-life Insurance Customers Based on Improved SOM and RFM Models[J]. 数据分析与知识发现, 2020, 4(4): 83-90.
[12] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[13] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[14] Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[15] Ni Weijian,Guo Haoyu,Liu Tong,Zeng Qingtian. Online Product Recommendation Based on Multi-Head Self-Attention Neural Networks[J]. 数据分析与知识发现, 2020, 4(2/3): 68-77.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn