Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (6): 109-117    DOI: 10.11925/infotech.2096-3467.2019.0321
Current Issue | Archive | Adv Search |
Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge
Wei Tingxin1,2,Bai Wenlei3,Qu Weiguang2,4()
1International College for Chinese Studies, Nanjing Normal University, Nanjing 210097, China
2School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
3State Grid Nari Group Corporation, Nanjing 210003, China
4School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China
Download: PDF(717 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper applies word embedding and word semantic knowledge to improve the sense prediction for Chinese Out Of Vocabulary (OOV). [Methods] First, we crawled webpages with OOV words. Then, we trained the Word2Vec and other embedding methods with the retrieved corpus. Finally, we improved the precision of OOV sense prediction with semantic knowledge of word formation, such as centro and pos filterings. [Results] We examined our method with datasets from the People’s Daily and found it achieved 87.5% precision on OOV sense prediction. Our result was much better than those of the models only adopting word embedding or based on semantic knowledge. [Limitations] The proposed model could not effectively predict semantically opaque OOV words. [Conclusions] Combining the external and internal information (i.e., word embedding and semantic knowledge) could remarkably improve the prediction of OOV words.

Key wordsOOV      Word Embedding      Semantic Knowledge      Sense Prediction     
Received: 25 March 2019      Published: 07 July 2020
ZTFLH:  TP391  
Corresponding Authors: Qu Weiguang     E-mail: wgqu_nj@163.com

Cite this article:

Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge. Data Analysis and Knowledge Discovery, 2020, 4(6): 109-117.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0321     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I6/109

Flow Diagram of the Experiment
编码位 符号举例 符号性质 级别
1 B 大类 第一级
2 i 中类 第二级
3 1 小类 第三级
4 8
5 A 词群 第四级
6 0 词集 第五级
7 1
8 =\#\@
Code Format of Cilin
候选词 语义类别 候选词 语义类别
航母 Bo30 海军 Di11
Dn08 服役 Hj22
驱逐舰 Bo30 一艘 Null
Bo30 舰艇 Bo30
富池 Null 两栖舰 Bo30
富池级 Null 该舰 Null
指挥舰 Null 潜艇 Bo30
远洋 Be05 补给船 Bo22
编队 Dd07 舾装 Ba05
护卫舰 Bo30 舰队 Di11
20 Most Similar Words to ‘Supply Ship’
候选词 语义类别 候选词 语义类别
癌症 Dl01 综合症 Dl01
老年痴呆症 Dl01 躁郁症 Null
抑郁症 Dl01 失智症 Null
阿兹海默症 Null 能症 Null
阿尔兹海默氏症 Null Dl01
精神分裂症 Null 抽动症 Null
尿毒症 Dl01
The Candidate Set of Similar Words to ‘Dementia’
候选词 语义类别
善款 Dj08
筹款 Hj31
捐款 Dj08
The Candidate Set of Similar Words to ‘Fundraising’
模型 有语义返回词数 正确数 正确率(%)
GloVe100 2 133 520 24.4
GloVe100+中心词 1 448 1 135 78.4
GloVe200+中心词 1 516 1 181 77.9
CBOW100 2 542 593 23.3
CBOW100+中心词 1 617 1 328 82.1
CBOW200+中心词 1 657 1 353 81.7
Skip-Gram100 2 389 712 29.8
Skip-Gram100+中心词 1 607 1 353 84.2
Skip-Gram200+中心词 1 622 1 339 82.6
☆基线模型 2 971 1 995 67.1
Prediction Results of Embedding+Centro Model and Baseline Models
模型 词向量训练 有语义返回词数 正确数 正确率(%) 召回率(%) F值(%)
词向量+尾字+词性 GloVe100 1 095 927 84.7 39.1 53.5
GloVe200 1 174 970 82.6 40.9 54.7
CBOW100 1 409 1 202 85.3 50.7 63.6
CBOW200 1 417 1 220 86.1 51.5 64.4
Skip-Gram100 1 371 1 224 89.3 51.6 65.4
Skip-Gram200 1 371 1 214 88.5 51.2 64.9
Results on Noun Prediction with Different Models
模型 中心词 有语义返回词数 正确数 正确率(%) 召回率(%) F值(%)
Skip-Gram+中心词+词性 None 364 107 29.4 18.6 22.8
首字 148 52 35.1 9.1 14.4
首字/尾字 240 144 60.0 25.1 35.4
尾字 190 143 75.3 24.9 37.4
Results on Verb Prediction with Different Models
模型 中心词 有语义返回词数 正确数 正确率(%) 召回率(%) F值(%)
Skip-Gram+中心词+词性 None 33 5 15.2 8.9 11.2
首字 19 10 52.6 17.9 26.7
首字/尾字 25 12 48.0 21.4 29.6
尾字 12 8 66.7 14.3 23.5
Results on Adjective Prediction with Different Models
模型 有语义返回数 正确数 正确率(%) 召回率(%) F值(%)
词向量+中心词 1 607 1 353 84.2 45.1 58.7
词向量+中心词+词性 1 573 1 376 87.5 45.9 60.2
Results of Embedding+Centro Model and Embedding+Centro+POS Model
模型 有语义返回数 正确数 正确率(%) 召回率(%) F值(%)
☆基线模型 2 971 1 995 67.1 66.5 66.8
本文级联模型 2 975 2 186 73.5 72.9 73.2
Results of Our Cascade Model and Baseline Model
类别 数量 比例(%) 实例
命名实体 24 15.3 曾侯乙、鸸鹋
文言词汇 27 17.2 夕曛、杲杲
方言词 11 7.0 饸饹、包谷糁
紧缩词 20 12.7 固氦、冷拼
字母词、音译词 11 7.0 激肽B、桑拿
专业领域词汇 25 15.9 胸腺肽、氧哌嗪
临时复合词 22 14.0 救命楼、精品展
其他 17 10.8 水困、超凡琦
The OOV Categories that cannot be Predicted by Our Model
[1] Chen H, Lin C. Sense-Tagging Chinese Corpus [C]//Proceedings of the 2nd Workshop on Chinese Language Processing. 2000: 7-14.
[2] 苑春法, 黄昌宁. 基于语素数据库的汉语语素及构词研究[J]. 世界汉语教学, 1998(2):8-13.
[2] ( Yuan Chunfa, Huang Changning. Study on Chinese Morphemes and