|
|
Discovering Chinese New Words Based on Multi-sense Word Embedding |
Zhang Le,Leng Jidong,Lv Xueqiang(),Yuan Menglong,You Xindong |
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China |
|
|
Abstract [Objective] This paper proposes a method to discover Chinese new words based on multi-sense word embedding, aiming to improve the word segmentation of social media texts. [Methods] Firstly, we trained the MWEC with social media texts, as well as data from Chinese HowNet and Chinese character stroke database to reduce the semantic confusion. Then, we used the n-gram frequent string mining method to identify the highly relevant sub-word set, and created the new candidate set. Finally, we used the semantic similarity of multi-sense word embedding to evaluate candidates and identified the new words. [Results] We examined the model with datasets of finance, sports, tourism and music. The MWEC improved the F1 value by 2.0, 3.0, 2.6 and 11.3 percentage points respectively compared with the existing methods. [Limitations] We generated candidate words based on the popularity of sub-words, which was difficult to identify the low-frequency words. [Conclusions] The multi-sense word embedding algorithm could effectively discover new words from Chinese social media texts.
|
Received: 07 July 2021
Published: 22 February 2022
|
|
Fund:Natural Science Foundation of Beijing(4212020);Open Project Fund of the Provincial Key Laboratory of Tibetan Intelligent Information Processing/the MOE Key Laboratory of Tibetan Information Processing(2019Z002);National Natural Science Foundation of China(61671070) |
Corresponding Authors:
Lv Xueqiang,ORCID:0000-0002-1422-0560
E-mail: icddtxyx@163.com
|
[1] |
Spence A, Beasley K, Gravenkemper H, et al. Social Media Use While Listening to New Material Negatively Affects Short-Term Memory in College Students[J]. Physiology & Behavior, 2020, 227:113172.
doi: 10.1016/j.physbeh.2020.113172
|
[2] |
Richard S, Shih C, Gale W, et al. A Stochastic Finite-State Word Segmentation Algorithm for Chinese[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. New York: ACL, 1994: 66-73.
|
[3] |
Sun X, Huang D G, Song H Y, et al. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1):14-24.
doi: 10.1007/s11390-011-9411-z
|
[4] |
Zheng Y, Liu Z, Sun M, et al. Incorporating User Behaviors in New Word Detection[C]// Proceedings of the 21st International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2009: 2101-2106.
|
[5] |
Chen K J, Bai M H. Unknown Word Detection for Chinese by a Corpus-based Learning Method[J]. Computational Linguistics, 1998, 3(1):27-44.
|
[6] |
Liang Y, Yin P, Yiu S M. New Word Detection and Tagging on Chinese Twitter Stream[A]//Hameurlain A, Küng J, Wagner R, et al. Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII[M]. Cham: Springer, 2017: 69-90.)
|
[7] |
Liang Y Z, Yang M, Zhu J, et al. Out-Domain Chinese New Word Detection with Statistics-Based Character Embedding[J]. Natural Language Engineering, 2019, 25(2):239-255.
doi: 10.1017/S1351324918000463
|
[8] |
Jiang D, Chen X, Yang X, et al. A Chinese New Word Detection Approach Based on Independence Testing[C]// Proceedings of the 11th International Conference on Artificial Intelligence and Symbolic Computation. Suzhou: IEEE, 2018: 227-236.
|
[9] |
张华平, 商建云. 面向社会媒体的开放领域新词发现[J]. 中文信息学报, 2017, 31(3):55-61.
|
[9] |
( Zhang Huaping, Shang Jianyun. Social Media-Oriented Open Domain New Word Detection[J]. Journal of Chinese Information Processing, 2017, 31(3):55-61.)
|
[10] |
刘昱彤, 吴斌, 谢韬, 等. 基于古汉语语料的新词发现方法[J]. 中文信息学报, 2019, 33(1):46-55.
|
[10] |
( Liu Yutong, Wu Bin, Xie Tao, et al. New Word Detection in Ancient Chinese Corpus[J]. Journal of Chinese Information Processing, 2019, 33(1):46-55.)
|
[11] |
Li W, Guo K, Shi Y, et al. DWWP: Domain-Specific New Words Detection and Word Propagation System for Sentiment Analysis in the Tourism Domain[J]. Knowledge-Based Systems, 2018, 146:203-214.
doi: 10.1016/j.knosys.2018.02.004
|
[12] |
陈梅婕, 谢振平, 陈晓琪, 等. 专利新词发现的双向聚合度特征提取新方法[J]. 计算机应用, 2020, 40(3):631-637.
|
[12] |
( Chen Meijie, Xie Zhenping, Chen Xiaoqi, et al. Novel Bidirectional Aggregation Degree Feature Extraction Method for Patent New Word Discovery[J]. Journal of Computer Applications, 2020, 40(3):631-637.)
|
[13] |
李少峰. 面向食品安全的新词发现和热词排行方法的研究与应用[D]. 广州: 中山大学, 2015.
|
[13] |
( Li Shaofeng. Research and Application on New Word Discovery and Hot Word Ranking for Food Security[D]. Guangzhou:Sun Yat-Sen University, 2015.)
|
[14] |
张长. 金融知识自动问答中的新词发现及答案排序方法[D]. 哈尔滨: 哈尔滨工业大学, 2017.
|
[14] |
( Zhang Chang. The Method of New Words Discovery and Answers Ranking in Finance Question Answering[D]. Harbin: Harbin Institute of Technology, 2017.)
|
[15] |
王馨, 王煜, 王亮. 基于新词发现的网络新闻热点排名[J]. 图书情报工作, 2015, 59(6):68-74.
|
[15] |
( Wang Xin, Wang Yu, Wang Liang. Hot News Ranking of Network News Based on New Words Detection[J]. Library and Information Service, 2015, 59(6):68-74.)
|
[16] |
彭郴, 吕学强, 孙宁, 等. 基于CNN的消费品缺陷领域词典构建方法研究[J]. 数据分析与知识发现, 2020, 4(11):112-120.
|
[16] |
( Peng Chen, Lv Xueqiang, Sun Ning, et al. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11):112-120.)
|
[17] |
Qian Y, Du Y, Deng X, et al. Detecting New Chinese Words from Massive Domain Texts with Word Embedding[J]. Journal of Information Science, 2019, 45(2):196-211.
doi: 10.1177/0165551518786676
|
[18] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the 2013 International Conference on Learning Representations. New York: ACM, 2013: 1156-1165.
|
[19] |
董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息学报, 2007, 21(4):3-9.
|
[19] |
( Dong Zhendong, Dong Qiang, Hao Changling. Theoretical Findings of HowNet[J]. Journal of Chinese Information Processing, 2007, 21(4):3-9.)
|
[20] |
王博, 代翔, 时聪, 等. 一种基于主动学习的中文新词识别算法[J]. 电讯技术, 2020, 60(11):1265-1270.
|
[20] |
( Wang Bo, Dai Xiang, Shi Cong, et al. Chinese New Words Recognition Based on Active Learning[J]. Telecommunication Engineering, 2020, 60(11):1265-1270.)
|
[21] |
唐共波, 于东, 荀恩东. 基于知网义原词向量表示的无监督词义消歧方法[J]. 中文信息学报, 2015, 29(6):23-29.
|
[21] |
( Tang Gongbo, Yu Dong, Xun Endong. An Unsupervised Word Sense Disambiguation Method Based on Sememe Vector in HowNet[J]. Journal of Chinese Information Processing, 2015, 29(6):23-29.)
|
[22] |
孙茂松, 陈新雄. 借重于人工知识库的词和义项的向量表示: 以HowNet为例[J]. 中文信息学报, 2016, 30(6):1-6.
|
[22] |
( Sun Maosong, Chen Xinxiong. Embedding for Words and Word Senses Based on Human Annotated Knowledge Base: A Case Study on HowNet[J]. Journal of Chinese Information Processing, 2016, 30(6):1-6.)
|
[23] |
Cao S, Lu W, Zhou J, et al. cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5053-5061.
|
[24] |
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3:1137-1155.
|
[25] |
Mnih A, Hinton G. Three New Graphical Models for Statistical Language Modelling[C]// Proceedings of the 24th International Conference on Machine Learning. ACM, 2007: 641-648.
|
[26] |
李小涛, 游树娟, 陈维. 一种基于词义向量模型的词语语义相似度算法[J]. 自动化学报, 2020, 46(8):1654-1669.
|
[26] |
( Li Xiaotao, You Shujuan, Chen Wei. An Algorithm of Semantic Similarity Between Words Based on Word Single-meaning Embedding Model[J]. Acta Automatica Sinica, 2020, 46(8):1654-1669.)
|
[27] |
Niu Y, Xie R, Liu Z, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 2049-2058.
|
[28] |
Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data[M]. Berlin Heidelberg: Springer, 2007.
|
[29] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
|
[30] |
Li B, Zhou H, He J, et al. On the Sentence Embeddings from BERT for Semantic Textual Similarity[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 9119-9130.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|