1Institute of Scientific and Technical Information of China, Beijing 100038, China 2Faculty of Linguistic Sciences, Beijing Language and Culture University, Beijing 100032, China
[Objective] This paper optimizes the vocabulary of Neural Machine Translation (NMT) in scientific and technical domain for the problem of vocabulary limitation and improves the translation performance. [Methods] Based on the word formation and Point-wise Mutual Information(PMI), the paper proposes a method to optimize the vocabulary while preserving the integrity of the lexical semanteme which reduces the number of unknown words. [Results] The NTCIR-2010 corpus and abstract of journal articles in the domain of automation and computer were selected for experiments. The experimental results were compared with the segmentation method and the sub-word method, and it proved the effectiveness of the method. [Limitations] This paper did not cover the optimization of non-Chinese characters. [Conclusions] The experiments show that in scientific and technical domain, the vocabulary optimization algorithm based on scientific word formation achieves better translation performance.
(Pang Bin.Machine Translation——From SMT to Neural Network[J]. Digital Communication World, 2016(12): 296-297.)
[4]
Cho K, Van Merrienboer B, Bahdanau D, et al.On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[OL]. arXiv Preprint, arXiv: 1409.1259.
(Zheng Xiaokang.Research on the Translation of out of Vocabulary Words in the Neural Machine Translation for Chinese and English Patent Corpus[D]. Beijing: Beijing Jiaotong University, 2017.)
[6]
Jean S, Cho K, Memisevic R, et al.On Using Very Large Target Vocabulary for Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1412.2007.
[7]
Luong M T, Sutskever I, Le Q V, et al.Addressing the Rare Word Problem in Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1410.8206.
[8]
Sennrich R, Haddow B, Birch A.Neural Machine Translation of Rare Words with Subword Units[OL]. arXiv Preprint, arXiv: 1508.07909.
[9]
Gage P.A New Algorithm for Data Compression[J]. The C Users Journal, 1994, 12(2): 23-38.
[10]
Luong M T, Manning C D.Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models[OL]. arXiv Preprint, arXiv: 1604.00788.
[11]
Costa-Jussà M R, Fonollosa J A R. Character-based Neural Machine Translation[OL]. arXiv Preprint, arXiv: 1603.00810.
[12]
Ataman D, Negri M, Turchi M, et al.Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English[J]. Prague Bulletin of Mathematical Linguistics, 2017, 108(1): 331-342.
[13]
Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
[14]
Bahdanau D, Cho K, Bengio Y.Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473.
[15]
Klein G, Kim Y, Deng Y, et al.OpenNMT: Open-Source Toolkit for Neural Machine Translation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations. 2017: 67-72.
[16]
Chang P C, Galley M, Manning C D.Optimizing Chinese Word Segmentation for Machine Translation Performance[C]// Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 224-232.
[17]
Bird S, Loper E.NLTK: The Natural Language Toolkit[C]// Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. 2004: 1-4.
[18]
Papineni K, Roukos S, Ward T, et al.BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002: 311-318.
[19]
Xiao T, Zhu J, Zhang H, et al.NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation[C]// Proceedings of the ACL 2012 System Demonstrations. 2012: 19-24.
(Zhou Lei, Li Ying, Shi Chongde.An Exploration on Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2015, 1(3): 64-75.)
(Zhou Lei, Li Ying, Shi Chongde.The Research on Factors Affecting Chinese Word Formation in Science and Technology[J]. Technology Intelligence Engineering, 2016, 2(1): 114-122.)
[22]
尹海良. 现代汉语类词缀研究[D]. 济南: 山东大学, 2007.
[22]
(Yin Hailiang.Study on the Quasi-affix of Modern Chinese[D]. Ji’nan: Shandong University, 2007.)
(Du Liping, Li Xiaoge, Zhou Yuanzhe, et al.Application of Improved Point-Wise Mutual Information in Term Extraction[J]. Journal of Computer Applications, 2015, 35(4): 996-1000.)
[24]
Wang Y, Zhou L, Zhang J, et al.Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT[A]// Machine Translation[M]. Springer, 2017: 30-42.