|
|
Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document |
Qingmin Liu1,Changqing Yao1,Chongde Shi1(),Xiaojie Wen2,Yueying Sun1 |
1Institute of Scientific and Technical Information of China, Beijing 100038, China 2Faculty of Linguistic Sciences, Beijing Language and Culture University, Beijing 100032, China |
|
|
Abstract [Objective] This paper optimizes the vocabulary of Neural Machine Translation (NMT) in scientific and technical domain for the problem of vocabulary limitation and improves the translation performance. [Methods] Based on the word formation and Point-wise Mutual Information(PMI), the paper proposes a method to optimize the vocabulary while preserving the integrity of the lexical semanteme which reduces the number of unknown words. [Results] The NTCIR-2010 corpus and abstract of journal articles in the domain of automation and computer were selected for experiments. The experimental results were compared with the segmentation method and the sub-word method, and it proved the effectiveness of the method. [Limitations] This paper did not cover the optimization of non-Chinese characters. [Conclusions] The experiments show that in scientific and technical domain, the vocabulary optimization algorithm based on scientific word formation achieves better translation performance.
|
Received: 28 June 2018
Published: 17 April 2019
|
[1] | 程霞. 面向科技文献的机器翻译[D]. 太原: 太原理工大学, 2006. | [1] | (Cheng Xia.Machine Translation of Sci-tech Document[D]. Taiyuan: Taiyuan University of Technology, 2006.) | [2] | 刘洋. 神经机器翻译前沿进展[J]. 计算机研究与发展, 2017, 54(6): 1144-1149. | [2] | (Liu Yang.Recent Advances in Neural Machine Translation[J]. Journal of Computer Research and Development, 2017, 54(6): 1144-1149.) | [3] | 庞斌. 机器翻译——从统计学方法到神经网络[J]. 数字通信世界, 2016(12): 296-297. | [3] | (Pang Bin.Machine Translation——From SMT to Neural Network[J]. Digital Communication World, 2016(12): 296-297.) | [4] | Cho K, Van Merrienboer B, Bahdanau D, et al.On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[OL]. arXiv Preprint, arXiv: 1409.1259. | [5] | 郑晓康. 面向汉英专利文献的神经网络翻译模型的集外词翻译研究[D]. 北京: 北京交通大学, 2017. | [5] | (Zheng Xiaokang.Research on the Translation of out of Vocabulary Words in the Neural Machine Translation for Chinese and English Patent Corpus[D]. Beijing: Beijing Jiaotong University, 2017.) | [6] | Jean S, Cho K, Memisevic R, et al.On Using Very Large Target Vocabulary for Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1412.2007. | [7] | Luong M T, Sutskever I, Le Q V, et al.Addressing the Rare Word Problem in Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1410.8206. | [8] | Sennrich R, Haddow B, Birch A.Neural Machine Translation of Rare Words with Subword Units[OL]. arXiv Preprint, arXiv: 1508.07909. | [9] | Gage P.A New Algorithm for Data Compression[J]. The C Users Journal, 1994, 12(2): 23-38. | [10] | Luong M T, Manning C D.Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models[OL]. arXiv Preprint, arXiv: 1604.00788. | [11] | Costa-Jussà M R, Fonollosa J A R. Character-based Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1603.00810. | [12] | Ataman D, Negri M, Turchi M, et al.Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English[J]. Prague Bulletin of Mathematical Linguistics, 2017, 108(1): 331-342. | [13] | Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155. | [14] | Bahdanau D, Cho K, Bengio Y.Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473. | [15] | Klein G, Kim Y, Deng Y, et al.OpenNMT: Open-Source Toolkit for Neural Machine Translation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations. 2017: 67-72. | [16] | Chang P C, Galley M, Manning C D.Optimizing Chinese Word Segmentation for Machine Translation Performance[C]// Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 224-232. | [17] | Bird S, Loper E.NLTK: The Natural Language Toolkit[C]// Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. 2004: 1-4. | [18] | Papineni K, Roukos S, Ward T, et al.BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002: 311-318. | [19] | Xiao T, Zhu J, Zhang H, et al.NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation[C]// Proceedings of the ACL 2012 System Demonstrations. 2012: 19-24. | [20] | 周雷, 李颖, 石崇德. 汉语科技词汇构词研究初探[J]. 情报工程, 2015, 1(3): 64-75. | [20] | (Zhou Lei, Li Ying, Shi Chongde.An Exploration on Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2015, 1(3): 64-75.) | [21] | 周雷, 李颖, 石崇德. 汉语科技词汇构词过程影响因素研究[J]. 情报工程, 2016, 2(1): 114-122. | [21] | (Zhou Lei, Li Ying, Shi Chongde.The Research on Factors Affecting Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2016, 2(1): 114-122.) | [22] | 尹海良. 现代汉语类词缀研究[D]. 济南: 山东大学, 2007. | [22] | (Yin Hailiang.Study on the Quasi-affix of Modern Chinese[D]. Ji’nan: Shandong University, 2007.) | [23] | 杜丽萍, 李晓戈, 周元哲, 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000. | [23] | (Du Liping, Li Xiaoge, Zhou Yuanzhe, et al.Application of Improved Point-Wise Mutual Information in Term Extraction[J]. Journal of Computer Applications, 2015, 35(4): 996-1000.) | [24] | Wang Y, Zhou L, Zhang J, et al.Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT[A]// Machine Translation[M]. Springer, 2017: 30-42. |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|