Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
[Objective] This paper proposes a method to discover Chinese new words based on multi-sense word embedding, aiming to improve the word segmentation of social media texts. [Methods] Firstly, we trained the MWEC with social media texts, as well as data from Chinese HowNet and Chinese character stroke database to reduce the semantic confusion. Then, we used the n-gram frequent string mining method to identify the highly relevant sub-word set, and created the new candidate set. Finally, we used the semantic similarity of multi-sense word embedding to evaluate candidates and identified the new words. [Results] We examined the model with datasets of finance, sports, tourism and music. The MWEC improved the F1 value by 2.0, 3.0, 2.6 and 11.3 percentage points respectively compared with the existing methods. [Limitations] We generated candidate words based on the popularity of sub-words, which was difficult to identify the low-frequency words. [Conclusions] The multi-sense word embedding algorithm could effectively discover new words from Chinese social media texts.
张乐, 冷基栋, 吕学强, 袁梦龙, 游新冬. MWEC:一种基于多语义词向量的中文新词发现方法*[J]. 数据分析与知识发现, 2022, 6(1): 113-121.
Zhang Le, Leng Jidong, Lv Xueqiang, Yuan Menglong, You Xindong. Discovering Chinese New Words Based on Multi-sense Word Embedding. Data Analysis and Knowledge Discovery, 2022, 6(1): 113-121.
Spence A, Beasley K, Gravenkemper H, et al. Social Media Use While Listening to New Material Negatively Affects Short-Term Memory in College Students[J]. Physiology & Behavior, 2020, 227:113172.
doi: 10.1016/j.physbeh.2020.113172
[2]
Richard S, Shih C, Gale W, et al. A Stochastic Finite-State Word Segmentation Algorithm for Chinese[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. New York: ACL, 1994: 66-73.
[3]
Sun X, Huang D G, Song H Y, et al. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1):14-24.
doi: 10.1007/s11390-011-9411-z
[4]
Zheng Y, Liu Z, Sun M, et al. Incorporating User Behaviors in New Word Detection[C]// Proceedings of the 21st International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2009: 2101-2106.
[5]
Chen K J, Bai M H. Unknown Word Detection for Chinese by a Corpus-based Learning Method[J]. Computational Linguistics, 1998, 3(1):27-44.
[6]
Liang Y, Yin P, Yiu S M. New Word Detection and Tagging on Chinese Twitter Stream[A]//Hameurlain A, Küng J, Wagner R, et al. Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII[M]. Cham: Springer, 2017: 69-90.)
[7]
Liang Y Z, Yang M, Zhu J, et al. Out-Domain Chinese New Word Detection with Statistics-Based Character Embedding[J]. Natural Language Engineering, 2019, 25(2):239-255.
doi: 10.1017/S1351324918000463
[8]
Jiang D, Chen X, Yang X, et al. A Chinese New Word Detection Approach Based on Independence Testing[C]// Proceedings of the 11th International Conference on Artificial Intelligence and Symbolic Computation. Suzhou: IEEE, 2018: 227-236.
( Zhang Huaping, Shang Jianyun. Social Media-Oriented Open Domain New Word Detection[J]. Journal of Chinese Information Processing, 2017, 31(3):55-61.)
( Liu Yutong, Wu Bin, Xie Tao, et al. New Word Detection in Ancient Chinese Corpus[J]. Journal of Chinese Information Processing, 2019, 33(1):46-55.)
[11]
Li W, Guo K, Shi Y, et al. DWWP: Domain-Specific New Words Detection and Word Propagation System for Sentiment Analysis in the Tourism Domain[J]. Knowledge-Based Systems, 2018, 146:203-214.
doi: 10.1016/j.knosys.2018.02.004
( Wang Xin, Wang Yu, Wang Liang. Hot News Ranking of Network News Based on New Words Detection[J]. Library and Information Service, 2015, 59(6):68-74.)
( Peng Chen, Lv Xueqiang, Sun Ning, et al. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11):112-120.)
[17]
Qian Y, Du Y, Deng X, et al. Detecting New Chinese Words from Massive Domain Texts with Word Embedding[J]. Journal of Information Science, 2019, 45(2):196-211.
doi: 10.1177/0165551518786676
[18]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the 2013 International Conference on Learning Representations. New York: ACM, 2013: 1156-1165.
( Wang Bo, Dai Xiang, Shi Cong, et al. Chinese New Words Recognition Based on Active Learning[J]. Telecommunication Engineering, 2020, 60(11):1265-1270.)
( Tang Gongbo, Yu Dong, Xun Endong. An Unsupervised Word Sense Disambiguation Method Based on Sememe Vector in HowNet[J]. Journal of Chinese Information Processing, 2015, 29(6):23-29.)
( Sun Maosong, Chen Xinxiong. Embedding for Words and Word Senses Based on Human Annotated Knowledge Base: A Case Study on HowNet[J]. Journal of Chinese Information Processing, 2016, 30(6):1-6.)
[23]
Cao S, Lu W, Zhou J, et al. cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5053-5061.
[24]
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3:1137-1155.
[25]
Mnih A, Hinton G. Three New Graphical Models for Statistical Language Modelling[C]// Proceedings of the 24th International Conference on Machine Learning. ACM, 2007: 641-648.
( Li Xiaotao, You Shujuan, Chen Wei. An Algorithm of Semantic Similarity Between Words Based on Word Single-meaning Embedding Model[J]. Acta Automatica Sinica, 2020, 46(8):1654-1669.)
[27]
Niu Y, Xie R, Liu Z, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 2049-2058.
[28]
Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data[M]. Berlin Heidelberg: Springer, 2007.
[29]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
[30]
Li B, Zhou H, He J, et al. On the Sentence Embeddings from BERT for Semantic Textual Similarity[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 9119-9130.