Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (3): 76-82    DOI: 10.11925/infotech.2096-3467.2018.0684
Current Issue | Archive | Adv Search |
Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document
Qingmin Liu1,Changqing Yao1,Chongde Shi1(),Xiaojie Wen2,Yueying Sun1
1Institute of Scientific and Technical Information of China, Beijing 100038, China
2Faculty of Linguistic Sciences, Beijing Language and Culture University, Beijing 100032, China
Download: PDF (519 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper optimizes the vocabulary of Neural Machine Translation (NMT) in scientific and technical domain for the problem of vocabulary limitation and improves the translation performance. [Methods] Based on the word formation and Point-wise Mutual Information(PMI), the paper proposes a method to optimize the vocabulary while preserving the integrity of the lexical semanteme which reduces the number of unknown words. [Results] The NTCIR-2010 corpus and abstract of journal articles in the domain of automation and computer were selected for experiments. The experimental results were compared with the segmentation method and the sub-word method, and it proved the effectiveness of the method. [Limitations] This paper did not cover the optimization of non-Chinese characters. [Conclusions] The experiments show that in scientific and technical domain, the vocabulary optimization algorithm based on scientific word formation achieves better translation performance.

Key wordsNeural Machine Translation      Scientific and Technical Document      Out of Vocabulary     
Received: 28 June 2018      Published: 17 April 2019

Cite this article:

Qingmin Liu,Changqing Yao,Chongde Shi,Xiaojie Wen,Yueying Sun. Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document. Data Analysis and Knowledge Discovery, 2019, 3(3): 76-82.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0684     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I3/76

[1] 程霞. 面向科技文献的机器翻译[D]. 太原: 太原理工大学, 2006.
[1] (Cheng Xia.Machine Translation of Sci-tech Document[D]. Taiyuan: Taiyuan University of Technology, 2006.)
[2] 刘洋. 神经机器翻译前沿进展[J]. 计算机研究与发展, 2017, 54(6): 1144-1149.
[2] (Liu Yang.Recent Advances in Neural Machine Translation[J]. Journal of Computer Research and Development, 2017, 54(6): 1144-1149.)
[3] 庞斌. 机器翻译——从统计学方法到神经网络[J]. 数字通信世界, 2016(12): 296-297.
[3] (Pang Bin.Machine Translation——From SMT to Neural Network[J]. Digital Communication World, 2016(12): 296-297.)
[4] Cho K, Van Merrienboer B, Bahdanau D, et al.On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[OL]. arXiv Preprint, arXiv: 1409.1259.
[5] 郑晓康. 面向汉英专利文献的神经网络翻译模型的集外词翻译研究[D]. 北京: 北京交通大学, 2017.
[5] (Zheng Xiaokang.Research on the Translation of out of Vocabulary Words in the Neural Machine Translation for Chinese and English Patent Corpus[D]. Beijing: Beijing Jiaotong University, 2017.)
[6] Jean S, Cho K, Memisevic R, et al.On Using Very Large Target Vocabulary for Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1412.2007.
[7] Luong M T, Sutskever I, Le Q V, et al.Addressing the Rare Word Problem in Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1410.8206.
[8] Sennrich R, Haddow B, Birch A.Neural Machine Translation of Rare Words with Subword Units[OL]. arXiv Preprint, arXiv: 1508.07909.
[9] Gage P.A New Algorithm for Data Compression[J]. The C Users Journal, 1994, 12(2): 23-38.
[10] Luong M T, Manning C D.Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models[OL]. arXiv Preprint, arXiv: 1604.00788.
[11] Costa-Jussà M R, Fonollosa J A R. Character-based Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1603.00810.
[12] Ataman D, Negri M, Turchi M, et al.Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English[J]. Prague Bulletin of Mathematical Linguistics, 2017, 108(1): 331-342.
[13] Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
[14] Bahdanau D, Cho K, Bengio Y.Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473.
[15] Klein G, Kim Y, Deng Y, et al.OpenNMT: Open-Source Toolkit for Neural Machine Translation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations. 2017: 67-72.
[16] Chang P C, Galley M, Manning C D.Optimizing Chinese Word Segmentation for Machine Translation Performance[C]// Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 224-232.
[17] Bird S, Loper E.NLTK: The Natural Language Toolkit[C]// Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. 2004: 1-4.
[18] Papineni K, Roukos S, Ward T, et al.BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002: 311-318.
[19] Xiao T, Zhu J, Zhang H, et al.NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation[C]// Proceedings of the ACL 2012 System Demonstrations. 2012: 19-24.
[20] 周雷, 李颖, 石崇德. 汉语科技词汇构词研究初探[J]. 情报工程, 2015, 1(3): 64-75.
[20] (Zhou Lei, Li Ying, Shi Chongde.An Exploration on Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2015, 1(3): 64-75.)
[21] 周雷, 李颖, 石崇德. 汉语科技词汇构词过程影响因素研究[J]. 情报工程, 2016, 2(1): 114-122.
[21] (Zhou Lei, Li Ying, Shi Chongde.The Research on Factors Affecting Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2016, 2(1): 114-122.)
[22] 尹海良. 现代汉语类词缀研究[D]. 济南: 山东大学, 2007.
[22] (Yin Hailiang.Study on the Quasi-affix of Modern Chinese[D]. Ji’nan: Shandong University, 2007.)
[23] 杜丽萍, 李晓戈, 周元哲, 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000.
[23] (Du Liping, Li Xiaoge, Zhou Yuanzhe, et al.Application of Improved Point-Wise Mutual Information in Term Extraction[J]. Journal of Computer Applications, 2015, 35(4): 996-1000.)
[24] Wang Y, Zhou L, Zhang J, et al.Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT[A]// Machine Translation[M]. Springer, 2017: 30-42.
[1] Xu Jianmin,Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn