Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (1): 113-121    DOI: 10.11925/infotech.2096-3467.2021.0684
Current Issue | Archive | Adv Search |
Discovering Chinese New Words Based on Multi-sense Word Embedding
Zhang Le,Leng Jidong,Lv Xueqiang(),Yuan Menglong,You Xindong
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
Download: PDF (1015 KB)   HTML ( 32
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a method to discover Chinese new words based on multi-sense word embedding, aiming to improve the word segmentation of social media texts. [Methods] Firstly, we trained the MWEC with social media texts, as well as data from Chinese HowNet and Chinese character stroke database to reduce the semantic confusion. Then, we used the n-gram frequent string mining method to identify the highly relevant sub-word set, and created the new candidate set. Finally, we used the semantic similarity of multi-sense word embedding to evaluate candidates and identified the new words. [Results] We examined the model with datasets of finance, sports, tourism and music. The MWEC improved the F1 value by 2.0, 3.0, 2.6 and 11.3 percentage points respectively compared with the existing methods. [Limitations] We generated candidate words based on the popularity of sub-words, which was difficult to identify the low-frequency words. [Conclusions] The multi-sense word embedding algorithm could effectively discover new words from Chinese social media texts.

Key wordsWord Embedding      New Word      Word Segmentation      N-gram      Multi-sense Word Embedding      Semantic Similarity     
Received: 07 July 2021      Published: 22 February 2022
ZTFLH:  TP391  
Fund:Natural Science Foundation of Beijing(4212020);Open Project Fund of the Provincial Key Laboratory of Tibetan Intelligent Information Processing/the MOE Key Laboratory of Tibetan Information Processing(2019Z002);National Natural Science Foundation of China(61671070)
Corresponding Authors: Lv Xueqiang,ORCID:0000-0002-1422-0560     E-mail: icddtxyx@163.com

Cite this article:

Zhang Le, Leng Jidong, Lv Xueqiang, Yuan Menglong, You Xindong. Discovering Chinese New Words Based on Multi-sense Word Embedding. Data Analysis and Knowledge Discovery, 2022, 6(1): 113-121.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0684     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I1/113

New Word Discovery Flow Chart
数据集 领域 URL 大小(MB) 句子数量 子词数量
DF 金融 http://finance.sina.com.cn/chanjing/ 20.1 155 168 3 475 459
DS 体育 http://sports.sohu.com/guojizuqiu_a.shtml 4.0 34 300 724 525
DT 旅游 http://www.mafengwo.cn 62.8 553 958 11 285 076
DM 音乐 http://music.163.com/ 35.0 662 640 6 818 760
The Data Set Information Statistics
金融领域文本 分句 分词 标注
奢侈品消费,将决战于“90后”一代。不可否认,目前中国消费者的奢侈品购买力,虽然仍集中于千万以上资产的人群,但奢侈品消费的“后劲”,则看千禧一代。 奢侈品消费,将决战于“90后”一代。 奢侈品 消费 将 决战 于 90 后 一代 奢侈品消费
不可否认,目前中国消费者的奢侈品购买力,虽然仍集中于千万以上资产的人群,但奢侈品消费的“后劲”,则看千禧一代。 不可否认 目前 中国 消费者 的 奢侈品 购买力 虽然 仍 集中 于 千万 以上 资产 的 人群 但 奢侈品 消费 的 后劲 则 看千禧 一代 中国消费者 奢侈品消费 千禧一代
The Examples of Data Annotation
Frequency Distribution of N-gram String
领域 余弦相似度 欧氏距离 曼哈顿距离
金融 0.702 0.659 0.670
体育 0.692 0.603 0.628
旅游 0.480 0.473 0.473
音乐 0.531 0.441 0.476
The F1 Index of Pruning Experiment
Comparing the Performance of Different Similarity Metrics in Candidate Pruning
领域 数据集大小 候选集 MWEC 新词标注
金融 2 000 280 197 173
体育 2 000 652 502 364
旅游 2 000 95 73 112
音乐 2 000 55 30 29
The Results on New Word Discovery
金融 体育 旅游 音乐
名贵/特产 比赛/结束 东/夹道 植物/大战/僵尸
八渡/水文站 联赛杯/八强 史家/胡同 道德/绑架
无目的地/航班 AC/米兰 爬/长城 网易/云/音乐
房地产/调控 主场/对阵 百花/草甸 网易/云
新冠/肺炎/疫情 血洗/林肯城 百花山/主峰 黑人/抬棺
北京/车展 海鸥/军团 老舍/纪念馆 中文/歌
生态/环保 英超/联赛 园博/园 火影/迷
合同/签署 佩里/西奇 深度/游 戳/爷
光线/传媒 鲁本/迪亚斯 鼓楼/东大街 螺旋/丸
Examples of New Word Discovery
数据集 方法 精确率 召回率 F1值
DF WEBM 0.643 0.734 0.689
+sense 0.596 0.796 0.682
+stroke 0.606 0.856 0.710
MWEC 0.655 0.773 0.709
DS WEBM 0.617 0.712 0.661
+sense 0.592 0.821 0.688
+stroke 0.520 0.874 0.652
MWEC 0.596 0.821 0.691
DT WEBM 0.552 0.429 0.482
+sense 0.643 0.420 0.508
+stroke 0.515 0.438 0.473
MWEC 0.644 0.420 0.508
DM WEBM 0.486 0.586 0.531
+sense 0.571 0.690 0.625
+stroke 0.528 0.655 0.585
MWEC 0.633 0.655 0.644
Ablation Experiment
方法 精确率 召回率 F1值
BERT(0.85) 0.560 0.728 0.633
BERT(0.80) 0.546 0.850 0.665
+sense 0.596 0.796 0.682
+stroke 0.606 0.856 0.710
MWEC 0.655 0.773 0.709
Experimental Comparison Results
[1] Spence A, Beasley K, Gravenkemper H, et al. Social Media Use While Listening to New Material Negatively Affects Short-Term Memory in College Students[J]. Physiology & Behavior, 2020, 227:113172.
doi: 10.1016/j.physbeh.2020.113172
[2] Richard S, Shih C, Gale W, et al. A Stochastic Finite-State Word Segmentation Algorithm for Chinese[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. New York: ACL, 1994: 66-73.
[3] Sun X, Huang D G, Song H Y, et al. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1):14-24.
doi: 10.1007/s11390-011-9411-z
[4] Zheng Y, Liu Z, Sun M, et al. Incorporating User Behaviors in New Word Detection[C]// Proceedings of the 21st International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2009: 2101-2106.
[5] Chen K J, Bai M H. Unknown Word Detection for Chinese by a Corpus-based Learning Method[J]. Computational Linguistics, 1998, 3(1):27-44.
[6] Liang Y, Yin P, Yiu S M. New Word Detection and Tagging on Chinese Twitter Stream[A]//Hameurlain A, Küng J, Wagner R, et al. Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII[M]. Cham: Springer, 2017: 69-90.)
[7] Liang Y Z, Yang M, Zhu J, et al. Out-Domain Chinese New Word Detection with Statistics-Based Character Embedding[J]. Natural Language Engineering, 2019, 25(2):239-255.
doi: 10.1017/S1351324918000463
[8] Jiang D, Chen X, Yang X, et al. A Chinese New Word Detection Approach Based on Independence Testing[C]// Proceedings of the 11th International Conference on Artificial Intelligence and Symbolic Computation. Suzhou: IEEE, 2018: 227-236.
[9] 张华平, 商建云. 面向社会媒体的开放领域新词发现[J]. 中文信息学报, 2017, 31(3):55-61.
[9] ( Zhang Huaping, Shang Jianyun. Social Media-Oriented Open Domain New Word Detection[J]. Journal of Chinese Information Processing, 2017, 31(3):55-61.)
[10] 刘昱彤, 吴斌, 谢韬, 等. 基于古汉语语料的新词发现方法[J]. 中文信息学报, 2019, 33(1):46-55.
[10] ( Liu Yutong, Wu Bin, Xie Tao, et al. New Word Detection in Ancient Chinese Corpus[J]. Journal of Chinese Information Processing, 2019, 33(1):46-55.)
[11] Li W, Guo K, Shi Y, et al. DWWP: Domain-Specific New Words Detection and Word Propagation System for Sentiment Analysis in the Tourism Domain[J]. Knowledge-Based Systems, 2018, 146:203-214.
doi: 10.1016/j.knosys.2018.02.004
[12] 陈梅婕, 谢振平, 陈晓琪, 等. 专利新词发现的双向聚合度特征提取新方法[J]. 计算机应用, 2020, 40(3):631-637.
[12] ( Chen Meijie, Xie Zhenping, Chen Xiaoqi, et al. Novel Bidirectional Aggregation Degree Feature Extraction Method for Patent New Word Discovery[J]. Journal of Computer Applications, 2020, 40(3):631-637.)
[13] 李少峰. 面向食品安全的新词发现和热词排行方法的研究与应用[D]. 广州: 中山大学, 2015.
[13] ( Li Shaofeng. Research and Application on New Word Discovery and Hot Word Ranking for Food Security[D]. Guangzhou:Sun Yat-Sen University, 2015.)
[14] 张长. 金融知识自动问答中的新词发现及答案排序方法[D]. 哈尔滨: 哈尔滨工业大学, 2017.
[14] ( Zhang Chang. The Method of New Words Discovery and Answers Ranking in Finance Question Answering[D]. Harbin: Harbin Institute of Technology, 2017.)
[15] 王馨, 王煜, 王亮. 基于新词发现的网络新闻热点排名[J]. 图书情报工作, 2015, 59(6):68-74.
[15] ( Wang Xin, Wang Yu, Wang Liang. Hot News Ranking of Network News Based on New Words Detection[J]. Library and Information Service, 2015, 59(6):68-74.)
[16] 彭郴, 吕学强, 孙宁, 等. 基于CNN的消费品缺陷领域词典构建方法研究[J]. 数据分析与知识发现, 2020, 4(11):112-120.
[16] ( Peng Chen, Lv Xueqiang, Sun Ning, et al. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11):112-120.)
[17] Qian Y, Du Y, Deng X, et al. Detecting New Chinese Words from Massive Domain Texts with Word Embedding[J]. Journal of Information Science, 2019, 45(2):196-211.
doi: 10.1177/0165551518786676
[18] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the 2013 International Conference on Learning Representations. New York: ACM, 2013: 1156-1165.
[19] 董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息学报, 2007, 21(4):3-9.
[19] ( Dong Zhendong, Dong Qiang, Hao Changling. Theoretical Findings of HowNet[J]. Journal of Chinese Information Processing, 2007, 21(4):3-9.)
[20] 王博, 代翔, 时聪, 等. 一种基于主动学习的中文新词识别算法[J]. 电讯技术, 2020, 60(11):1265-1270.
[20] ( Wang Bo, Dai Xiang, Shi Cong, et al. Chinese New Words Recognition Based on Active Learning[J]. Telecommunication Engineering, 2020, 60(11):1265-1270.)
[21] 唐共波, 于东, 荀恩东. 基于知网义原词向量表示的无监督词义消歧方法[J]. 中文信息学报, 2015, 29(6):23-29.
[21] ( Tang Gongbo, Yu Dong, Xun Endong. An Unsupervised Word Sense Disambiguation Method Based on Sememe Vector in HowNet[J]. Journal of Chinese Information Processing, 2015, 29(6):23-29.)
[22] 孙茂松, 陈新雄. 借重于人工知识库的词和义项的向量表示: 以HowNet为例[J]. 中文信息学报, 2016, 30(6):1-6.
[22] ( Sun Maosong, Chen Xinxiong. Embedding for Words and Word Senses Based on Human Annotated Knowledge Base: A Case Study on HowNet[J]. Journal of Chinese Information Processing, 2016, 30(6):1-6.)
[23] Cao S, Lu W, Zhou J, et al. cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5053-5061.
[24] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3:1137-1155.
[25] Mnih A, Hinton G. Three New Graphical Models for Statistical Language Modelling[C]// Proceedings of the 24th International Conference on Machine Learning. ACM, 2007: 641-648.
[26] 李小涛, 游树娟, 陈维. 一种基于词义向量模型的词语语义相似度算法[J]. 自动化学报, 2020, 46(8):1654-1669.
[26] ( Li Xiaotao, You Shujuan, Chen Wei. An Algorithm of Semantic Similarity Between Words Based on Word Single-meaning Embedding Model[J]. Acta Automatica Sinica, 2020, 46(8):1654-1669.)
[27] Niu Y, Xie R, Liu Z, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 2049-2058.
[28] Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data[M]. Berlin Heidelberg: Springer, 2007.
[29] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
[30] Li B, Zhou H, He J, et al. On the Sentence Embeddings from BERT for Semantic Textual Similarity[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 9119-9130.
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[3] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[4] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[5] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[6] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[7] Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[8] Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[9] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[10] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[11] Xianlai Chen,Chaopeng Han,Ying An,Li Liu,Zhongmin Li,Rong Yang. Extracting New Words with Mutual Information and Logistic Regression[J]. 数据分析与知识发现, 2019, 3(8): 105-113.
[12] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[13] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[14] Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[15] Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn