Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (3): 76-82     https://doi.org/10.11925/infotech.2096-3467.2018.0684
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向科技文献神经机器翻译词汇表优化研究*
刘清民1,姚长青1,石崇德1(),温晓洁2,孙玥莹1
1中国科学技术信息研究所 北京 100038
2北京语言大学语言科学院 北京 100032
Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document
Qingmin Liu1,Changqing Yao1,Chongde Shi1(),Xiaojie Wen2,Yueying Sun1
1Institute of Scientific and Technical Information of China, Beijing 100038, China
2Faculty of Linguistic Sciences, Beijing Language and Culture University, Beijing 100032, China
全文: PDF (519 KB)   HTML ( 7
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】针对面向科技文献的神经机器翻译中存在的词汇表受限问题, 提出优化方法, 进而提升翻译质量。【方法】根据科技词汇构词规律, 结合点互信息, 在保留词汇义素完整的同时, 对神经机器翻译词汇表进行优化, 达到减少未登录词的目的。【结果】选择NTCIR-2010专利语料和自动化计算机领域期刊论文摘要语料进行实验, 将实验结果与普通分词和子词分词对比, 证明该方法的有效性。【局限】仅考虑中文字符的优化。【结论】在中文科技文献领域, 基于科技词汇构词的词汇表优化方法能够提升翻译效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘清民
姚长青
石崇德
温晓洁
孙玥莹
关键词 神经机器翻译科技文献未登录词    
Abstract

[Objective] This paper optimizes the vocabulary of Neural Machine Translation (NMT) in scientific and technical domain for the problem of vocabulary limitation and improves the translation performance. [Methods] Based on the word formation and Point-wise Mutual Information(PMI), the paper proposes a method to optimize the vocabulary while preserving the integrity of the lexical semanteme which reduces the number of unknown words. [Results] The NTCIR-2010 corpus and abstract of journal articles in the domain of automation and computer were selected for experiments. The experimental results were compared with the segmentation method and the sub-word method, and it proved the effectiveness of the method. [Limitations] This paper did not cover the optimization of non-Chinese characters. [Conclusions] The experiments show that in scientific and technical domain, the vocabulary optimization algorithm based on scientific word formation achieves better translation performance.

Key wordsNeural Machine Translation    Scientific and Technical Document    Out of Vocabulary
收稿日期: 2018-06-28      出版日期: 2019-04-17
基金资助:*本文系国家自然科学基金项目“面向科技监测的实体识别与关系抽取研究”(项目编号: 71403257)、中国科学技术信息研究所重点工作项目“日汉机器翻译双语资源建设与翻译引擎研发” (项目编号: ZD2017-4)和中国科学技术信息研究所创新研究基金项目“基于上下文信息的神经机器翻译未登录词分析”(项目编号: QN2018-06)的研究成果之一
引用本文:   
刘清民,姚长青,石崇德,温晓洁,孙玥莹. 面向科技文献神经机器翻译词汇表优化研究*[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
Qingmin Liu,Changqing Yao,Chongde Shi,Xiaojie Wen,Yueying Sun. Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document. Data Analysis and Knowledge Discovery, 2019, 3(3): 76-82.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0684      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I3/76
[1] 程霞. 面向科技文献的机器翻译[D]. 太原: 太原理工大学, 2006.
[1] (Cheng Xia.Machine Translation of Sci-tech Document[D]. Taiyuan: Taiyuan University of Technology, 2006.)
[2] 刘洋. 神经机器翻译前沿进展[J]. 计算机研究与发展, 2017, 54(6): 1144-1149.
[2] (Liu Yang.Recent Advances in Neural Machine Translation[J]. Journal of Computer Research and Development, 2017, 54(6): 1144-1149.)
[3] 庞斌. 机器翻译——从统计学方法到神经网络[J]. 数字通信世界, 2016(12): 296-297.
[3] (Pang Bin.Machine Translation——From SMT to Neural Network[J]. Digital Communication World, 2016(12): 296-297.)
[4] Cho K, Van Merrienboer B, Bahdanau D, et al.On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[OL]. arXiv Preprint, arXiv: 1409.1259.
[5] 郑晓康. 面向汉英专利文献的神经网络翻译模型的集外词翻译研究[D]. 北京: 北京交通大学, 2017.
[5] (Zheng Xiaokang.Research on the Translation of out of Vocabulary Words in the Neural Machine Translation for Chinese and English Patent Corpus[D]. Beijing: Beijing Jiaotong University, 2017.)
[6] Jean S, Cho K, Memisevic R, et al.On Using Very Large Target Vocabulary for Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1412.2007.
[7] Luong M T, Sutskever I, Le Q V, et al.Addressing the Rare Word Problem in Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1410.8206.
[8] Sennrich R, Haddow B, Birch A.Neural Machine Translation of Rare Words with Subword Units[OL]. arXiv Preprint, arXiv: 1508.07909.
[9] Gage P.A New Algorithm for Data Compression[J]. The C Users Journal, 1994, 12(2): 23-38.
[10] Luong M T, Manning C D.Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models[OL]. arXiv Preprint, arXiv: 1604.00788.
[11] Costa-Jussà M R, Fonollosa J A R. Character-based Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1603.00810.
[12] Ataman D, Negri M, Turchi M, et al.Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English[J]. Prague Bulletin of Mathematical Linguistics, 2017, 108(1): 331-342.
[13] Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
[14] Bahdanau D, Cho K, Bengio Y.Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473.
[15] Klein G, Kim Y, Deng Y, et al.OpenNMT: Open-Source Toolkit for Neural Machine Translation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations. 2017: 67-72.
[16] Chang P C, Galley M, Manning C D.Optimizing Chinese Word Segmentation for Machine Translation Performance[C]// Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 224-232.
[17] Bird S, Loper E.NLTK: The Natural Language Toolkit[C]// Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. 2004: 1-4.
[18] Papineni K, Roukos S, Ward T, et al.BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002: 311-318.
[19] Xiao T, Zhu J, Zhang H, et al.NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation[C]// Proceedings of the ACL 2012 System Demonstrations. 2012: 19-24.
[20] 周雷, 李颖, 石崇德. 汉语科技词汇构词研究初探[J]. 情报工程, 2015, 1(3): 64-75.
[20] (Zhou Lei, Li Ying, Shi Chongde.An Exploration on Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2015, 1(3): 64-75.)
[21] 周雷, 李颖, 石崇德. 汉语科技词汇构词过程影响因素研究[J]. 情报工程, 2016, 2(1): 114-122.
[21] (Zhou Lei, Li Ying, Shi Chongde.The Research on Factors Affecting Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2016, 2(1): 114-122.)
[22] 尹海良. 现代汉语类词缀研究[D]. 济南: 山东大学, 2007.
[22] (Yin Hailiang.Study on the Quasi-affix of Modern Chinese[D]. Ji’nan: Shandong University, 2007.)
[23] 杜丽萍, 李晓戈, 周元哲, 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000.
[23] (Du Liping, Li Xiaoge, Zhou Yuanzhe, et al.Application of Improved Point-Wise Mutual Information in Term Extraction[J]. Journal of Computer Applications, 2015, 35(4): 996-1000.)
[24] Wang Y, Zhou L, Zhang J, et al.Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT[A]// Machine Translation[M]. Springer, 2017: 30-42.
[1] 柴庆凤, 史霖炎, 梅珊, 熊海涛, 贺惠新. 基于人工特征和机器特征融合的科技文献知识元抽取*[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[2] 王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[3] 魏庭新,柏文雷,曲维光. 词向量和语义知识相结合的汉语未登录词语义预测研究*[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[4] 徐红霞,李春旺. 科技文献内容知识点抽取研究综述[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[5] 王佳琪, 张均胜, 乔晓东. 基于文献的科研事件表示与语义链接研究*[J]. 数据分析与知识发现, 2018, 2(5): 32-39.
[6] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[7] 王颖, 吴振新, 谢靖. 面向科技文献的语义检索系统研究综述[J]. 现代图书情报技术, 2015, 31(5): 1-7.
[8] 段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[9] 张帆, 乐小虬. 面向领域科技文献的句子级创新点抽取研究[J]. 现代图书情报技术, 2014, 30(9): 15-21.
[10] 孙海霞, 李军莲, 吴英杰, 吴夙慧. 基于混合策略的中文生物医学领域未登录词识别研究[J]. 现代图书情报技术, 2013, 29(1): 15-21.
[11] 张琪, 章颖华. 情境感知的科技文献协同推荐方法研究[J]. 现代图书情报技术, 2012, 28(2): 10-17.
[12] 邢美凤. 科技文献关键词冗余解决方案研究[J]. 现代图书情报技术, 2012, 28(1): 34-39.
[13] 张金柱,张东,王惠临. 基于字位信息的中文分词方法研究*[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[14] 黄水清,程冲 . 基于既定词表的自适应汉语分词技术研究[J]. 现代图书情报技术, 2006, 1(5): 13-17.
[15] 张莉华. 科技文献检索课网页制作谈[J]. 现代图书情报技术, 1999, 15(6): 55-56.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn