Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge
Wei Tingxin1,2,Bai Wenlei3,Qu Weiguang2,4()
1International College for Chinese Studies, Nanjing Normal University, Nanjing 210097, China 2School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China 3State Grid Nari Group Corporation, Nanjing 210003, China 4School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China
[Objective] This paper applies word embedding and word semantic knowledge to improve the sense prediction for Chinese Out Of Vocabulary (OOV). [Methods] First, we crawled webpages with OOV words. Then, we trained the Word2Vec and other embedding methods with the retrieved corpus. Finally, we improved the precision of OOV sense prediction with semantic knowledge of word formation, such as centro and pos filterings. [Results] We examined our method with datasets from the People’s Daily and found it achieved 87.5% precision on OOV sense prediction. Our result was much better than those of the models only adopting word embedding or based on semantic knowledge. [Limitations] The proposed model could not effectively predict semantically opaque OOV words. [Conclusions] Combining the external and internal information (i.e., word embedding and semantic knowledge) could remarkably improve the prediction of OOV words.
魏庭新,柏文雷,曲维光. 词向量和语义知识相结合的汉语未登录词语义预测研究*[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge. Data Analysis and Knowledge Discovery, 2020, 4(6): 109-117.
Chen C J. Character-Sense Association and Compounding Template Similarity: Automatic Semantic Classification of Chinese Compounds [C]// Proceedings of the 3rd SIGHAN Workshop on Chinese Language Processing. 2004: 33-40.
Lu X F. Hybrid Model for Semantic Classification of Chinese Unknown Words [C]//Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2007: 188-195.
( Shang Fenfen, Gu Yanhui, Dai Rubing, et al. Research on the Sense Guessing of Chinese Unknown Words Based on “Semantic Knowledge-base of Modern Chinese”[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016,52(1):10-16.)
(HIT-SCIR. Tongyici Cilin (Extended Version[EB/OL]. [ 2019-02-02]. https://www.ltp-cloud.com/download. )
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
Williams E. On the Notions Lexically Related and Head of a Word[J]. Linguistic Inquiry, 1981,12(2):245-274.
Packard J L. The Morphology of Chinese: A Linguistic and Cognitive Approach (汉语形态学:语言认知研究法)[M]. Beijing: Foreign Language Teaching and Research Press, UK: Cambridge University Press, 2001: 39-40.
Ceccagno A, Scalise S. Classification Structure and Headedness of Chinese Compounds[J]. Lingue e Linguaggio, 2006,5(2):233-260.