Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (6): 109-117    DOI: 10.11925/infotech.2096-3467.2019.0321
Current Issue | Archive | Adv Search |
Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge
Wei Tingxin1,2,Bai Wenlei3,Qu Weiguang2,4()
1International College for Chinese Studies, Nanjing Normal University, Nanjing 210097, China
2School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
3State Grid Nari Group Corporation, Nanjing 210003, China
4School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China
Download: PDF (717 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper applies word embedding and word semantic knowledge to improve the sense prediction for Chinese Out Of Vocabulary (OOV). [Methods] First, we crawled webpages with OOV words. Then, we trained the Word2Vec and other embedding methods with the retrieved corpus. Finally, we improved the precision of OOV sense prediction with semantic knowledge of word formation, such as centro and pos filterings. [Results] We examined our method with datasets from the People’s Daily and found it achieved 87.5% precision on OOV sense prediction. Our result was much better than those of the models only adopting word embedding or based on semantic knowledge. [Limitations] The proposed model could not effectively predict semantically opaque OOV words. [Conclusions] Combining the external and internal information (i.e., word embedding and semantic knowledge) could remarkably improve the prediction of OOV words.

Key wordsOOV      Word Embedding      Semantic Knowledge      Sense Prediction     
Received: 25 March 2019      Published: 07 July 2020
ZTFLH:  TP391  
Corresponding Authors: Qu Weiguang     E-mail: wgqu_nj@163.com

Cite this article:

Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge. Data Analysis and Knowledge Discovery, 2020, 4(6): 109-117.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0321     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I6/109

Flow Diagram of the Experiment
编码位 符号举例 符号性质 级别
1 B 大类 第一级
2 i 中类 第二级
3 1 小类 第三级
4 8
5 A 词群 第四级
6 0 词集 第五级
7 1
8 =\#\@
Code Format of Cilin
候选词 语义类别 候选词 语义类别
航母 Bo30 海军 Di11
Dn08 服役 Hj22
驱逐舰 Bo30 一艘 Null
Bo30 舰艇 Bo30
富池 Null 两栖舰 Bo30
富池级 Null 该舰 Null
指挥舰 Null 潜艇 Bo30
远洋 Be05 补给船 Bo22
编队 Dd07 舾装 Ba05
护卫舰 Bo30 舰队 Di11
20 Most Similar Words to ‘Supply Ship’
候选词 语义类别 候选词 语义类别
癌症 Dl01 综合症 Dl01
老年痴呆症 Dl01 躁郁症 Null
抑郁症 Dl01 失智症 Null
阿兹海默症 Null 能症 Null
阿尔兹海默氏症 Null Dl01
精神分裂症 Null 抽动症 Null
尿毒症 Dl01
The Candidate Set of Similar Words to ‘Dementia’
候选词 语义类别
善款 Dj08
筹款 Hj31
捐款 Dj08
The Candidate Set of Similar Words to ‘Fundraising’
模型 有语义返回词数 正确数 正确率(%)
GloVe100 2 133 520 24.4
GloVe100+中心词 1 448 1 135 78.4
GloVe200+中心词 1 516 1 181 77.9
CBOW100 2 542 593 23.3
CBOW100+中心词 1 617 1 328 82.1
CBOW200+中心词 1 657 1 353 81.7
Skip-Gram100 2 389 712 29.8
Skip-Gram100+中心词 1 607 1 353 84.2
Skip-Gram200+中心词 1 622 1 339 82.6
☆基线模型 2 971 1 995 67.1
Prediction Results of Embedding+Centro Model and Baseline Models
模型 词向量训练 有语义返回词数 正确数 正确率(%) 召回率(%) F值(%)
词向量+尾字+词性 GloVe100 1 095 927 84.7 39.1 53.5
GloVe200 1 174 970 82.6 40.9 54.7
CBOW100 1 409 1 202 85.3 50.7 63.6
CBOW200 1 417 1 220 86.1 51.5 64.4
Skip-Gram100 1 371 1 224 89.3 51.6 65.4
Skip-Gram200 1 371 1 214 88.5 51.2 64.9
Results on Noun Prediction with Different Models
模型 中心词 有语义返回词数 正确数 正确率(%) 召回率(%) F值(%)
Skip-Gram+中心词+词性 None 364 107 29.4 18.6 22.8
首字 148 52 35.1 9.1 14.4
首字/尾字 240 144 60.0 25.1 35.4
尾字 190 143 75.3 24.9 37.4
Results on Verb Prediction with Different Models
模型 中心词 有语义返回词数 正确数 正确率(%) 召回率(%) F值(%)
Skip-Gram+中心词+词性 None 33 5 15.2 8.9 11.2
首字 19 10 52.6 17.9 26.7
首字/尾字 25 12 48.0 21.4 29.6
尾字 12 8 66.7 14.3 23.5
Results on Adjective Prediction with Different Models
模型 有语义返回数 正确数 正确率(%) 召回率(%) F值(%)
词向量+中心词 1 607 1 353 84.2 45.1 58.7
词向量+中心词+词性 1 573 1 376 87.5 45.9 60.2
Results of Embedding+Centro Model and Embedding+Centro+POS Model
模型 有语义返回数 正确数 正确率(%) 召回率(%) F值(%)
☆基线模型 2 971 1 995 67.1 66.5 66.8
本文级联模型 2 975 2 186 73.5 72.9 73.2
Results of Our Cascade Model and Baseline Model
类别 数量 比例(%) 实例
命名实体 24 15.3 曾侯乙、鸸鹋
文言词汇 27 17.2 夕曛、杲杲
方言词 11 7.0 饸饹、包谷糁
紧缩词 20 12.7 固氦、冷拼
字母词、音译词 11 7.0 激肽B、桑拿
专业领域词汇 25 15.9 胸腺肽、氧哌嗪
临时复合词 22 14.0 救命楼、精品展
其他 17 10.8 水困、超凡琦
The OOV Categories that cannot be Predicted by Our Model
[1] Chen H, Lin C. Sense-Tagging Chinese Corpus [C]//Proceedings of the 2nd Workshop on Chinese Language Processing. 2000: 7-14.
[2] 苑春法, 黄昌宁. 基于语素数据库的汉语语素及构词研究[J]. 世界汉语教学, 1998(2):8-13.
[2] ( Yuan Chunfa, Huang Changning. Study on Chinese Morphemes and Word Formation Based on Chinese Morpheme Data Bank[J]. Chinese Teaching in the World, 1998(2):8-13.)
[3] Chen K J, Chen C J. Automatic Semantic Classification for Chinese Unknown Compound Nouns [C]//Proceedings of the 18th International Conference on Computational Linguistics (COLING). 2000: 173-179.
[4] 梅家驹. 同义词词林[M]. 上海: 上海辞书出版社, 1983.
[4] ( Mei Jiaju. Tongyici Cilin[M]. Shanghai: Shanghai Lexicographical Publishing House, 1983.)
[5] Chen C J. Character-Sense Association and Compounding Template Similarity: Automatic Semantic Classification of Chinese Compounds [C]// Proceedings of the 3rd SIGHAN Workshop on Chinese Language Processing. 2004: 33-40.
[6] Lu X F. Hybrid Model for Semantic Classification of Chinese Unknown Words [C]//Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2007: 188-195.
[7] 尚芬芬, 顾彦慧, 戴茹冰, 等. 基于《现代汉语语义词典》的未登录词语义预测研究[J]. 北京大学学报:自然科学版, 2016,52(1):10-16.
[7] ( Shang Fenfen, Gu Yanhui, Dai Rubing, et al. Research on the Sense Guessing of Chinese Unknown Words Based on “Semantic Knowledge-base of Modern Chinese”[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016,52(1):10-16.)
[8] 吉志薇, 冯敏萱. 面向普通未登录词理解的二字词语义构词研究[J]. 中文信息学报, 2015,29(5):63-69.
[8] ( Ji Zhiwei, Feng Minxuan. A Study on Semantic Word-Formation of Bi-Character for Common Unknown Word Understanding[J]. Journal of Chinese Information Processing, 2015,29(5):63-69.)
[9] 田元贺, 刘扬. 汉语未登录词的词义知识表示及语义预测[J]. 中文信息学报, 2016,30(6):26-34.
[9] ( Tian Yuanhe, Liu Yang. Lexical Knowledge Representation and Sense Prediction of Chinese Unknown Words[J]. Journal of Chinese Information Processing, 2016,30(6):26-34.)
[10] Lu X F. Hybrid Model for Chinese Unknown Word Resolution[D]. Ohio State University, 2006.
[11] Langacker R W. Foundations of Cognitive Grammar, Vol.1, Theoretical Prerequisites[M]. Stanford: Stanford University Press, 1987: 402.
[12] Harris Z S. Distributional Structure[J]. Word, 1954,10(2-3):146-162.
doi: 10.1080/00437956.1954.11659520
[13] Firth J R. A Synopsis of Linguistic Theory 1930-1955[A]// Studies in Linguistic Analysis[M]. Oxford: Blackwell, 1957: 1-31.
[14] Bengio Y, Rejean D, Pascal V. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003(3):1137-1155.
[15] 柏文雷. 面向全文标注的未登录词语义研究与实现[D]. 南京:南京师范大学, 2017.
[15] ( Bai Wenlei. Research on Prediction of Unknown Words Sense and Application in Text Sense Tagging[D]. Nanjing: Nanjing Normal University, 2017.)
[16] 哈尔滨工业大学信息检索研究中心. 同义词词林(扩展板)[EB/OL].[2019-02-02]. https://www.ltp-cloud.com/download.
[16] (HIT-SCIR. Tongyici Cilin (Extended Version[EB/OL]. [ 2019-02-02]. https://www.ltp-cloud.com/download. )
[17] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[18] Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[19] Williams E. On the Notions Lexically Related and Head of a Word[J]. Linguistic Inquiry, 1981,12(2):245-274.
[20] Packard J L. The Morphology of Chinese: A Linguistic and Cognitive Approach (汉语形态学:语言认知研究法)[M]. Beijing: Foreign Language Teaching and Research Press, UK: Cambridge University Press, 2001: 39-40.
[21] Ceccagno A, Scalise S. Classification Structure and Headedness of Chinese Compounds[J]. Lingue e Linguaggio, 2006,5(2):233-260.
[22] 徐艳华, 亢世勇. 基于语料库的新造词语的构词法研究 [C]//第一届学生计算语言学研讨会. 2002: 286-291.
[22] ( Xu Yanhua, Kang Shiyong. Researches on Word-Formation of New Word Based on the Corpus [C]//Proceedings of the 1st Student Computational Linguistics Seminar. 2002: 286-291.)
[23] 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002,16(5):51-66.
[23] ( Yu Shiwen, Duan Huiming, Zhu Xuefeng, et al. The Basic Processing of Contemporary Chinese Corpus at Peking University SPECIFICATION[J]. Journal of Chinese Information Processing, 2002,16(5):51-66.)
[1] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[2] Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[3] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[4] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[5] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[6] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[7] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[8] Tingting Wang,Man Han,Yu Wang. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[9] Qin Zhang,Hongmei Guo,Zhixiong Zhang. Extracting Entity Relationship with Word Embedding Representation Features[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[10] Jing Xie,Jingdong Wang,Zhenxin Wu,Zhixiong Zhang,Ying Wang,Zhifei Ye. Building Semantic Enrichment Framework for Scientific Literature Retrieval System[J]. 数据分析与知识发现, 2017, 1(4): 84-93.
[11] Tian Xia. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[12] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[13] Gu Wei, Li Chaofan, Wang Hongjun, Xiao Shibin, Shi Shuicai. Acquisition of Synonym from Patent Query Logs[J]. 现代图书情报技术, 2015, 31(2): 24-30.
[14] Qin Ying. Applying Bilingual Lexicons to Detect Correspondences in English-Chinese Cross-lingual Plagiarism Documents[J]. 现代图书情报技术, 2014, 30(7): 114-119.
[15] Hu Zhengyin, Fang Shu. Review on Text-based Patent Technology Mining[J]. 现代图书情报技术, 2014, 30(6): 62-70.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn