|
|
Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge |
Wei Tingxin1,2,Bai Wenlei3,Qu Weiguang2,4() |
1International College for Chinese Studies, Nanjing Normal University, Nanjing 210097, China 2School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China 3State Grid Nari Group Corporation, Nanjing 210003, China 4School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China |
|
|
Abstract [Objective] This paper applies word embedding and word semantic knowledge to improve the sense prediction for Chinese Out Of Vocabulary (OOV). [Methods] First, we crawled webpages with OOV words. Then, we trained the Word2Vec and other embedding methods with the retrieved corpus. Finally, we improved the precision of OOV sense prediction with semantic knowledge of word formation, such as centro and pos filterings. [Results] We examined our method with datasets from the People’s Daily and found it achieved 87.5% precision on OOV sense prediction. Our result was much better than those of the models only adopting word embedding or based on semantic knowledge. [Limitations] The proposed model could not effectively predict semantically opaque OOV words. [Conclusions] Combining the external and internal information (i.e., word embedding and semantic knowledge) could remarkably improve the prediction of OOV words.
|
Received: 25 March 2019
Published: 07 July 2020
|
|
Corresponding Authors:
Qu Weiguang
E-mail: wgqu_nj@163.com
|
[1] |
Chen H, Lin C. Sense-Tagging Chinese Corpus [C]//Proceedings of the 2nd Workshop on Chinese Language Processing. 2000: 7-14.
|
[2] |
苑春法, 黄昌宁. 基于语素数据库的汉语语素及构词研究[J]. 世界汉语教学, 1998(2):8-13.
|
[2] |
( Yuan Chunfa, Huang Changning. Study on Chinese Morphemes and Word Formation Based on Chinese Morpheme Data Bank[J]. Chinese Teaching in the World, 1998(2):8-13.)
|
[3] |
Chen K J, Chen C J. Automatic Semantic Classification for Chinese Unknown Compound Nouns [C]//Proceedings of the 18th International Conference on Computational Linguistics (COLING). 2000: 173-179.
|
[4] |
梅家驹. 同义词词林[M]. 上海: 上海辞书出版社, 1983.
|
[4] |
( Mei Jiaju. Tongyici Cilin[M]. Shanghai: Shanghai Lexicographical Publishing House, 1983.)
|
[5] |
Chen C J. Character-Sense Association and Compounding Template Similarity: Automatic Semantic Classification of Chinese Compounds [C]// Proceedings of the 3rd SIGHAN Workshop on Chinese Language Processing. 2004: 33-40.
|
[6] |
Lu X F. Hybrid Model for Semantic Classification of Chinese Unknown Words [C]//Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2007: 188-195.
|
[7] |
尚芬芬, 顾彦慧, 戴茹冰, 等. 基于《现代汉语语义词典》的未登录词语义预测研究[J]. 北京大学学报:自然科学版, 2016,52(1):10-16.
|
[7] |
( Shang Fenfen, Gu Yanhui, Dai Rubing, et al. Research on the Sense Guessing of Chinese Unknown Words Based on “Semantic Knowledge-base of Modern Chinese”[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016,52(1):10-16.)
|
[8] |
吉志薇, 冯敏萱. 面向普通未登录词理解的二字词语义构词研究[J]. 中文信息学报, 2015,29(5):63-69.
|
[8] |
( Ji Zhiwei, Feng Minxuan. A Study on Semantic Word-Formation of Bi-Character for Common Unknown Word Understanding[J]. Journal of Chinese Information Processing, 2015,29(5):63-69.)
|
[9] |
田元贺, 刘扬. 汉语未登录词的词义知识表示及语义预测[J]. 中文信息学报, 2016,30(6):26-34.
|
[9] |
( Tian Yuanhe, Liu Yang. Lexical Knowledge Representation and Sense Prediction of Chinese Unknown Words[J]. Journal of Chinese Information Processing, 2016,30(6):26-34.)
|
[10] |
Lu X F. Hybrid Model for Chinese Unknown Word Resolution[D]. Ohio State University, 2006.
|
[11] |
Langacker R W. Foundations of Cognitive Grammar, Vol.1, Theoretical Prerequisites[M]. Stanford: Stanford University Press, 1987: 402.
|
[12] |
Harris Z S. Distributional Structure[J]. Word, 1954,10(2-3):146-162.
doi: 10.1080/00437956.1954.11659520
|
[13] |
Firth J R. A Synopsis of Linguistic Theory 1930-1955[A]// Studies in Linguistic Analysis[M]. Oxford: Blackwell, 1957: 1-31.
|
[14] |
Bengio Y, Rejean D, Pascal V. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003(3):1137-1155.
|
[15] |
柏文雷. 面向全文标注的未登录词语义研究与实现[D]. 南京:南京师范大学, 2017.
|
[15] |
( Bai Wenlei. Research on Prediction of Unknown Words Sense and Application in Text Sense Tagging[D]. Nanjing: Nanjing Normal University, 2017.)
|
[16] |
哈尔滨工业大学信息检索研究中心. 同义词词林(扩展板)[EB/OL].[2019-02-02]. https://www.ltp-cloud.com/download.
|
[16] |
(HIT-SCIR. Tongyici Cilin (Extended Version[EB/OL]. [ 2019-02-02]. https://www.ltp-cloud.com/download. )
|
[17] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
|
[18] |
Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
|
[19] |
Williams E. On the Notions Lexically Related and Head of a Word[J]. Linguistic Inquiry, 1981,12(2):245-274.
|
[20] |
Packard J L. The Morphology of Chinese: A Linguistic and Cognitive Approach (汉语形态学:语言认知研究法)[M]. Beijing: Foreign Language Teaching and Research Press, UK: Cambridge University Press, 2001: 39-40.
|
[21] |
Ceccagno A, Scalise S. Classification Structure and Headedness of Chinese Compounds[J]. Lingue e Linguaggio, 2006,5(2):233-260.
|
[22] |
徐艳华, 亢世勇. 基于语料库的新造词语的构词法研究 [C]//第一届学生计算语言学研讨会. 2002: 286-291.
|
[22] |
( Xu Yanhua, Kang Shiyong. Researches on Word-Formation of New Word Based on the Corpus [C]//Proceedings of the 1st Student Computational Linguistics Seminar. 2002: 286-291.)
|
[23] |
俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002,16(5):51-66.
|
[23] |
( Yu Shiwen, Duan Huiming, Zhu Xuefeng, et al. The Basic Processing of Contemporary Chinese Corpus at Peking University SPECIFICATION[J]. Journal of Chinese Information Processing, 2002,16(5):51-66.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|