采用连续词袋模型(CBOW)的领域术语自动抽取研究<sup>*</sup>

doi:10.11925/infotech.1003-3513.2016.02.02

现代图书情报技术

2016, Vol. 32

Issue (2): 9-15 https://doi.org/10.11925/infotech.1003-3513.2016.02.02

研究论文

本期目录 | 过刊浏览 | 高级检索

采用连续词袋模型(CBOW)的领域术语自动抽取研究^*

姜霖^1,²,王东波³(

)

¹南京大学信息管理学院南京 210023
²江苏省数据工程与知识服务重点实验室南京 210023
³南京农业大学信息科学技术学院南京 210095

Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model

Jiang Lin^1,²,Wang Dongbo³(

)

¹School of Information Management, Nanjing University, Nanjing 210023, China
²Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
³College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China

摘要
参考文献
相关文章
Metrics

全文: PDF (880 KB) HTML ( 82 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】更准确便捷地完成术语词汇的自动抽取。【方法】利用CBOW模型计算构成术语的各个词部件的向量空间模型。通过词向量之间的余弦相似度衡量术语词汇内部各个词部件的关联度。利用PageRank算法计算候选词汇的领域代表性并排序, 通过阈值的设定, 抽取出更为具有领域代表性的术语词汇。【结果】在以自然语言处理领域内的论文摘要作为数据集的实验中取得较高的准确率和召回率。【局限】测试的数据训练集偏小, 而数据集的训练效果直接影响实验的效果。【结论】实验结果表明利用CBOW模型完成术语的抽取工作是一个较为合理、可行的方法。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	姜霖
	王东波

关键词 ：术语抽取, 神经网络, CBOW模型

Abstract：

[Objective] This study tries to extract domain terms more accurately and conveniently. [Methods] First, proposed a method using the CBOW model to build word vectors for each component of the terms. Then, applied the cosine similarity to calculate the internal correlation degree among each term’s individual components. To get more representative terms, we used the PageRank algorithm to rank the candidates. [Results] We obtained high recall and precision rates using the paper abstacts in the field of natural language processing as the training pool. [Limitations] The training pool was relatively small, which might influence the results. [Conclusions] This study shows that CBOW model is a more appropriate method to extract terminologies.

Key words： Terminology extraction Neural network Continuous Bag-of-Words Model

收稿日期: 2015-09-06 出版日期: 2016-03-08

基金资助:*本文系南京农业大学人文社会科学研究基金项目“人文社会科学组块级汉英平行语料库构建及知识挖掘研究”(项目编号:SK2013023)和国家自然科学基金项目“基于CSSCI的句法级汉英平行语料库构建及知识挖掘研究”(项目编号:71303120)的研究成果之一

引用本文:

姜霖,王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究^*[J]. 现代图书情报技术, 2016, 32(2): 9-15.
Jiang Lin,Wang Dongbo. Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model. New Technology of Library and Information Service, 2016, 32(2): 9-15.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.02.02 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I2/9

[1]	吴云芳, 穗志方, 邱利坤, 等. 信息科学与技术领域术语部件描述[J]. 语言文字应用, 2003(4): 34-39.
[1]	(Wu Yunfang, Sui Zhifang, Qiu Likun, et al.The Approaches and Strategies to Describe the Term Component in Information Science and Technology[J]. Applied Linguistics, 2003(4): 34-39.)
[2]	张榕. 术语定义抽取、聚类与术语识别研究[D]. 北京: 北京语言大学, 2006.
[2]	(Zhang Rong.Research on Extraction and Clustering of Term Definition and Term Extraction [D]. Beijing: Beijing Language and Culture University, 2006.)
[3]	李芸. 信息科学和信息技术术语概念体系研究[D]. 北京: 北京语言大学, 2003.
[3]	(Li Yun.Concept System of Terminology of Information Sciences and Information Technologies: A Preliminary Study [D]. Beijing: Beijing Language and Culture University, 2003.)
[4]	Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases [C]. In: Proceedings of the 14th Conference on Computational Linguistics. Association for Computational Linguistics, 1992: 977-981.
[5]	Justeson J S, Katz S M.Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text[J]. Natural Language Engineering, 1995, 1(1): 9-27.
[6]	Ananiadou S.A Methodology for Automatic Term Recognition [C]. In: Proceedings of the 15th Conference on Computational Linguistics. Association for Computational Linguistics, 1994: 1034-1038.
[7]	张峰, 许云, 侯艳, 等. 基于互信息的中文术语抽取系统[J]. 计算机应用研究, 2005(5): 72-77.
[7]	(Zhang Feng, Xu Yun, Hou Yan, et al.Chinese Term Extraction System Based on Mutual Information[J]. Application Research of Computers, 2005(5): 72-77.)
[8]	Frantzi K, Ananiadou S, Mima H.Automatic Recognition of Multi-word Terms: The C-value/NC-value Method[J]. International Journal on Digital Libraries, 2000, 3(2): 115-130.
[9]	Manning C D, Schutze H.统计自然语言处理基础[M]. 范春法译. 第4版. 北京: 电子工业出版社, 2005: 95-97.
[9]	(Manning C D, Schutze H.Foundations of Statistical Natural Language Processing [M]. Translated by Fan Chunfa. The 4th Edition. Beijing: Electronic Industry Press, 2005: 95-97.)
[10]	Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition [C]. In: Proceedings of the 6th Conference on Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002: 1-7.
[11]	Lafferty J D, McCallum A, Pereira F C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2001:282-289.
[12]	章成志. 基于多层术语度的一体化术语抽取研究[J]. 情报学报, 2011, 30(3): 275-285.
[12]	(Zhang Chengzhi.Using Integration Strategy and Multi-level Termhood to Extract Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(3): 275-285.)
[13]	Lee C M, Huang C K, Tang K M, et al.Iterative Machine-Learning Chinese Term Extraction [C]. In: Proceedings of the 14th International Conference on Asia-Pacific Digital Libraries (ICADL 2012), Taipei, China. Berlin: Springer, 2012: 309-312.
[14]	刘克强. 2009共享版ICTCLAS 的分析与使用[J]. 科教文汇, 2009(22): 271, 280.
[14]	(Liu Keqiang. The Analysing and Using of the2009 Shared Version of ICTCLAS[J]. The Science Education Article Collects, 2009(22): 271, 280.)
[15]	周浪. 中文术语抽取若干问题研究[D]. 南京: 南京理工大学, 2009.
[15]	(Zhou Lang.A Study on the Chinese Term Extraction [D]. Nanjing: Nanjing University of Science and Technology, 2009.)
[16]	Mikolov T. Word2vec Code [CP/OL]. [2015-09-18]. .
[17]	周练. Word2vec 的工作原理及应用探究[J]. 科技情报开发与经济, 2015, 25(2): 145-148.
[17]	(Zhou Lian.Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015, 25(2): 145-148.)
[18]	罗刚. 解密搜索引擎技术实战[M]. 北京: 电子工业出版社, 2011: 73-74.
[18]	(Luo Gang.Actual Explaining the Technologies of the Search Engine [M]. Beijing: Electronic Industry Press, 2011: 73-74.)

[1]	范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2]	范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3]	顾耀文, 张博文, 郑思, 杨丰春, 李姣. 基于图注意力网络的药物ADMET分类预测模型构建方法^*[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[4]	张乐, 冷基栋, 吕学强, 崔卓, 王磊, 游新冬. RLCPAR：一种基于强化学习的中文专利摘要改写模型*[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[5]	韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究^*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[6]	孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究^*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[7]	王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究^*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[8]	李丹阳, 甘明鑫. 基于多源信息融合的音乐推荐方法 ^*[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[9]	程铁军, 王曼, 黄宝凤, 冯兰萍. 基于CEEMDAN-BP模型的突发事件网络舆情预测研究^*[J]. 数据分析与知识发现, 2021, 5(11): 59-67.
[10]	丁浩, 艾文华, 胡广伟, 李树青, 索炜. 融合用户兴趣波动时序的个性化推荐模型^*[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[11]	尹浩然,曹金璇,曹鲁喆,王国栋. 扩充语义维度的BiGRU-AM突发事件要素识别研究^*[J]. 数据分析与知识发现, 2020, 4(9): 91-99.
[12]	邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[13]	王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[14]	刘伟江,魏海,运天鹤. 基于卷积神经网络的客户信用评估模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[15]	王末,崔运鹏,陈丽,李欢. 基于深度学习的学术论文语步结构分类方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 60-68.

Viewed

Full text

Abstract

Cited

Shared

Discussed