Discovering Chinese New Words Based on Multi-sense Word Embedding

doi:10.11925/infotech.2096-3467.2021.0684

Data Analysis and Knowledge Discovery

2022, Vol. 6

Issue (1): 113-121 DOI: 10.11925/infotech.2096-3467.2021.0684

Current Issue | Archive | Adv Search

Discovering Chinese New Words Based on Multi-sense Word Embedding

Zhang Le,Leng Jidong,Lv Xueqiang(

),Yuan Menglong,You Xindong

Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China

Download: PDF (1015 KB) HTML ( 32 )
Export: BibTeX | EndNote (RIS)

Abstract

[Objective] This paper proposes a method to discover Chinese new words based on multi-sense word embedding, aiming to improve the word segmentation of social media texts. [Methods] Firstly, we trained the MWEC with social media texts, as well as data from Chinese HowNet and Chinese character stroke database to reduce the semantic confusion. Then, we used the n-gram frequent string mining method to identify the highly relevant sub-word set, and created the new candidate set. Finally, we used the semantic similarity of multi-sense word embedding to evaluate candidates and identified the new words. [Results] We examined the model with datasets of finance, sports, tourism and music. The MWEC improved the F1 value by 2.0, 3.0, 2.6 and 11.3 percentage points respectively compared with the existing methods. [Limitations] We generated candidate words based on the popularity of sub-words, which was difficult to identify the low-frequency words. [Conclusions] The multi-sense word embedding algorithm could effectively discover new words from Chinese social media texts.

Key words： Word Embedding New Word Word Segmentation N-gram Multi-sense Word Embedding Semantic Similarity

Received: 07 July 2021 Published: 22 February 2022

ZTFLH:

TP391

Fund:Natural Science Foundation of Beijing(4212020);Open Project Fund of the Provincial Key Laboratory of Tibetan Intelligent Information Processing/the MOE Key Laboratory of Tibetan Information Processing(2019Z002);National Natural Science Foundation of China(61671070)

Corresponding Authors: Lv Xueqiang，ORCID：0000-0002-1422-0560 E-mail: icddtxyx@163.com

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Le Zhang
	Jidong Leng
	Xueqiang Lv
	Menglong Yuan
	Xindong You

Cite this article:

Zhang Le, Leng Jidong, Lv Xueqiang, Yuan Menglong, You Xindong. Discovering Chinese New Words Based on Multi-sense Word Embedding. Data Analysis and Knowledge Discovery, 2022, 6(1): 113-121.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0684 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I1/113

New Word Discovery Flow Chart

The Data Set Information Statistics

The Examples of Data Annotation

Frequency Distribution of N-gram String

The F1 Index of Pruning Experiment

Comparing the Performance of Different Similarity Metrics in Candidate Pruning

The Results on New Word Discovery

Examples of New Word Discovery

Ablation Experiment

Experimental Comparison Results

[1]	Spence A, Beasley K, Gravenkemper H, et al. Social Media Use While Listening to New Material Negatively Affects Short-Term Memory in College Students[J]. Physiology & Behavior, 2020, 227:113172. doi: 10.1016/j.physbeh.2020.113172
[2]	Richard S, Shih C, Gale W, et al. A Stochastic Finite-State Word Segmentation Algorithm for Chinese[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. New York: ACL, 1994: 66-73.
[3]	Sun X, Huang D G, Song H Y, et al. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1):14-24. doi: 10.1007/s11390-011-9411-z
[4]	Zheng Y, Liu Z, Sun M, et al. Incorporating User Behaviors in New Word Detection[C]// Proceedings of the 21st International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2009: 2101-2106.
[5]	Chen K J, Bai M H. Unknown Word Detection for Chinese by a Corpus-based Learning Method[J]. Computational Linguistics, 1998, 3(1):27-44.
[6]	Liang Y, Yin P, Yiu S M. New Word Detection and Tagging on Chinese Twitter Stream[A]//Hameurlain A, Küng J, Wagner R, et al. Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII[M]. Cham: Springer, 2017: 69-90.)
[7]	Liang Y Z, Yang M, Zhu J, et al. Out-Domain Chinese New Word Detection with Statistics-Based Character Embedding[J]. Natural Language Engineering, 2019, 25(2):239-255. doi: 10.1017/S1351324918000463
[8]	Jiang D, Chen X, Yang X, et al. A Chinese New Word Detection Approach Based on Independence Testing[C]// Proceedings of the 11th International Conference on Artificial Intelligence and Symbolic Computation. Suzhou: IEEE, 2018: 227-236.
[9]	张华平, 商建云. 面向社会媒体的开放领域新词发现[J]. 中文信息学报, 2017, 31(3):55-61.
[9]	( Zhang Huaping, Shang Jianyun. Social Media-Oriented Open Domain New Word Detection[J]. Journal of Chinese Information Processing, 2017, 31(3):55-61.)
[10]	刘昱彤, 吴斌, 谢韬, 等. 基于古汉语语料的新词发现方法[J]. 中文信息学报, 2019, 33(1):46-55.
[10]	( Liu Yutong, Wu Bin, Xie Tao, et al. New Word Detection in Ancient Chinese Corpus[J]. Journal of Chinese Information Processing, 2019, 33(1):46-55.)
[11]	Li W, Guo K, Shi Y, et al. DWWP: Domain-Specific New Words Detection and Word Propagation System for Sentiment Analysis in the Tourism Domain[J]. Knowledge-Based Systems, 2018, 146:203-214. doi: 10.1016/j.knosys.2018.02.004
[12]	陈梅婕, 谢振平, 陈晓琪, 等. 专利新词发现的双向聚合度特征提取新方法[J]. 计算机应用, 2020, 40(3):631-637.
[12]	( Chen Meijie, Xie Zhenping, Chen Xiaoqi, et al. Novel Bidirectional Aggregation Degree Feature Extraction Method for Patent New Word Discovery[J]. Journal of Computer Applications, 2020, 40(3):631-637.)
[13]	李少峰. 面向食品安全的新词发现和热词排行方法的研究与应用[D]. 广州: 中山大学, 2015.
[13]	( Li Shaofeng. Research and Application on New Word Discovery and Hot Word Ranking for Food Security[D]. Guangzhou:Sun Yat-Sen University, 2015.)
[14]	张长. 金融知识自动问答中的新词发现及答案排序方法[D]. 哈尔滨: 哈尔滨工业大学, 2017.
[14]	( Zhang Chang. The Method of New Words Discovery and Answers Ranking in Finance Question Answering[D]. Harbin: Harbin Institute of Technology, 2017.)
[15]	王馨, 王煜, 王亮. 基于新词发现的网络新闻热点排名[J]. 图书情报工作, 2015, 59(6):68-74.
[15]	( Wang Xin, Wang Yu, Wang Liang. Hot News Ranking of Network News Based on New Words Detection[J]. Library and Information Service, 2015, 59(6):68-74.)
[16]	彭郴, 吕学强, 孙宁, 等. 基于CNN的消费品缺陷领域词典构建方法研究[J]. 数据分析与知识发现, 2020, 4(11):112-120.
[16]	( Peng Chen, Lv Xueqiang, Sun Ning, et al. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11):112-120.)
[17]	Qian Y, Du Y, Deng X, et al. Detecting New Chinese Words from Massive Domain Texts with Word Embedding[J]. Journal of Information Science, 2019, 45(2):196-211. doi: 10.1177/0165551518786676
[18]	Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the 2013 International Conference on Learning Representations. New York: ACM, 2013: 1156-1165.
[19]	董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息学报, 2007, 21(4):3-9.
[19]	( Dong Zhendong, Dong Qiang, Hao Changling. Theoretical Findings of HowNet[J]. Journal of Chinese Information Processing, 2007, 21(4):3-9.)
[20]	王博, 代翔, 时聪, 等. 一种基于主动学习的中文新词识别算法[J]. 电讯技术, 2020, 60(11):1265-1270.
[20]	( Wang Bo, Dai Xiang, Shi Cong, et al. Chinese New Words Recognition Based on Active Learning[J]. Telecommunication Engineering, 2020, 60(11):1265-1270.)
[21]	唐共波, 于东, 荀恩东. 基于知网义原词向量表示的无监督词义消歧方法[J]. 中文信息学报, 2015, 29(6):23-29.
[21]	( Tang Gongbo, Yu Dong, Xun Endong. An Unsupervised Word Sense Disambiguation Method Based on Sememe Vector in HowNet[J]. Journal of Chinese Information Processing, 2015, 29(6):23-29.)
[22]	孙茂松, 陈新雄. 借重于人工知识库的词和义项的向量表示: 以HowNet为例[J]. 中文信息学报, 2016, 30(6):1-6.
[22]	( Sun Maosong, Chen Xinxiong. Embedding for Words and Word Senses Based on Human Annotated Knowledge Base: A Case Study on HowNet[J]. Journal of Chinese Information Processing, 2016, 30(6):1-6.)
[23]	Cao S, Lu W, Zhou J, et al. cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5053-5061.
[24]	Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3:1137-1155.
[25]	Mnih A, Hinton G. Three New Graphical Models for Statistical Language Modelling[C]// Proceedings of the 24th International Conference on Machine Learning. ACM, 2007: 641-648.
[26]	李小涛, 游树娟, 陈维. 一种基于词义向量模型的词语语义相似度算法[J]. 自动化学报, 2020, 46(8):1654-1669.
[26]	( Li Xiaotao, You Shujuan, Chen Wei. An Algorithm of Semantic Similarity Between Words Based on Word Single-meaning Embedding Model[J]. Acta Automatica Sinica, 2020, 46(8):1654-1669.)
[27]	Niu Y, Xie R, Liu Z, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 2049-2058.
[28]	Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data[M]. Berlin Heidelberg: Springer, 2007.
[29]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
[30]	Li B, Zhou H, He J, et al. On the Sentence Embeddings from BERT for Semantic Textual Similarity[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 9119-9130.

[1]	Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2]	Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[3]	Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[4]	Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[5]	Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[6]	Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[7]	Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
[8]	Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[9]	Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[10]	Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[11]	Xianlai Chen,Chaopeng Han,Ying An,Li Liu,Zhongmin Li,Rong Yang. Extracting New Words with Mutual Information and Logistic Regression[J]. 数据分析与知识发现, 2019, 3(8): 105-113.
[12]	Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[13]	Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[14]	Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[15]	Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.

Viewed

Full text

Abstract

Cited

Shared

Discussed