[Objective] This paper tries to improve the accuracy of word segmentation for literature with lots of scientific terms. [Methods] First, we programed the DBLC model, which combined the methods of dictionary, statistics and deep learning. Then, we retrieved articles from the Chinese Management Case Center to build the experimental corpus. Finally, we compared the performance of this new model with the existing ones. [Results] The performance of the DBLC model was better than others. Its word segmentation accuracy was up to 96.3%. [Limitations] We did not separate the words of the original dictionary from the new words. We did not re-design the storage structure of the dictionary, which prolonged the computing time of our model. [Conclusions] The proposed DBLC model improves the accuracy of word segmentation, which is also positively co-related to the dictionary size.
冯国明, 张晓冬, 刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning. Data Analysis and Knowledge Discovery, 2018, 2(5): 40-47.
(Liu Yuan, Tan Qiang, Shen Xukun.Modern Chinese Word Segmentation and Automatic Word Segmentation Method for Information Processing[M]. Beijing: Standards Press of China, 1994.)
[2]
Sui Z, Chen Y, Hu J.The Research on the Automatic Term Extraction in the Domain of Information Science and Technology[C]// Proceedings of the 5th East Asia Forum of the Terminology. 2002.
[3]
Xue N.Chinese Word Segmentation as Character Tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1): 29-47.
(Liu Qun, Zhang Huaping, Yu Hongkui, et al.Chinese Lexical Analysis Using Cascaded Hidden Markov Mode[J]. Journal of Computer Research and Development, 2004, 41(8): 1421-1429.)
[5]
Peng F, Feng F, McCallum A. Chinese Segmentation and New Word Detection Using Conditional Random Fields[C]// Proceedings of the 20th International Conference on Computational Linguistics. 2004.
[6]
Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
(Xu Haoyu, Ren Zhihui, Shi Jun, et al.An Improved Chinese Word Segmentation Method Based on Chain Conditional Random Fields[J]. Computer Applications and Software, 2016, 33(12): 211-213.)
doi: 10.3969/j.issn.1000-386x.2016.12.050
(Deng Liping, Luo Zhiyong.Domain Adaptation of Chinese Word Segmentation on Semi-Supervised Conditional Random Fields[J]. Journal of Chinese Information Processing, 2017, 31(4): 9-19.)
[9]
Collobert R, Weston J.A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning[C]// Proceedings of the 25th International Conference on Machine Learning. 2008: 160-167.
[10]
Zheng X, Chen H, Xu T.Deep Learning for Chinese Word Segmentation and POS Tagging[C]// Proceedings of the 2015 International Conference on Empirical Methods in Natural Language Processing. 2013.
[11]
Chen X, Qiu X, Zhu C, et al.Long Short-Term Memory Neural Networks for Chinese Word Segmentation[C]// Proceedings of the 2015 International Conference on Empirical Methods in Natural Language Processing. 2015: 1197-1206.
[12]
Yao K, Cohn T, Vylomova K, et al. Depth-Gated Recurrent Neural Networks[OL]. arXiv Preprint, arXiv: 1508.03790, 2015.
(Zhang Zirui, Liu Yunqing.Chinese Word Segmentation Based on Bi-directional LSTM-CRF Model[J]. Journal of Changchun University of Science and Technology, 2017, 40(4): 87-92)
[14]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781, 2013.
(Wang Huixian, Long Hua.The Research of Chinese Word Segmentation Algorithm Based on Forward Maximum Match[J]. Journal of Guizhou University: Natural Science, 2011, 28(5): 112-115.)
doi: 10.3969/j.issn.1000-5269.2011.05.027