|
|
DBLC Model for Word Segmentation Based on Autonomous Learning |
Feng Guoming, Zhang Xiaodong(), Liu Suhui |
School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China |
|
|
Abstract [Objective] This paper tries to improve the accuracy of word segmentation for literature with lots of scientific terms. [Methods] First, we programed the DBLC model, which combined the methods of dictionary, statistics and deep learning. Then, we retrieved articles from the Chinese Management Case Center to build the experimental corpus. Finally, we compared the performance of this new model with the existing ones. [Results] The performance of the DBLC model was better than others. Its word segmentation accuracy was up to 96.3%. [Limitations] We did not separate the words of the original dictionary from the new words. We did not re-design the storage structure of the dictionary, which prolonged the computing time of our model. [Conclusions] The proposed DBLC model improves the accuracy of word segmentation, which is also positively co-related to the dictionary size.
|
Received: 21 December 2017
Published: 20 June 2018
|
|
[1] |
刘源, 谭强, 沈旭昆. 信息处理用现代汉语分词规范及自动分词方法[M]. 北京: 中国标准出版社, 1994.
|
[1] |
(Liu Yuan, Tan Qiang, Shen Xukun.Modern Chinese Word Segmentation and Automatic Word Segmentation Method for Information Processing[M]. Beijing: Standards Press of China, 1994.)
|
[2] |
Sui Z, Chen Y, Hu J.The Research on the Automatic Term Extraction in the Domain of Information Science and Technology[C]// Proceedings of the 5th East Asia Forum of the Terminology. 2002.
|
[3] |
Xue N.Chinese Word Segmentation as Character Tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1): 29-47.
|
[4] |
刘群, 张华平, 俞鸿魁, 等. 基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展, 2004, 41(8): 1421-1429.
|
[4] |
(Liu Qun, Zhang Huaping, Yu Hongkui, et al.Chinese Lexical Analysis Using Cascaded Hidden Markov Mode[J]. Journal of Computer Research and Development, 2004, 41(8): 1421-1429.)
|
[5] |
Peng F, Feng F, McCallum A. Chinese Segmentation and New Word Detection Using Conditional Random Fields[C]// Proceedings of the 20th International Conference on Computational Linguistics. 2004.
|
[6] |
Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
|
[7] |
徐浩煜, 任智慧, 施俊, 等. 基于链式条件随机场的中文分词改进方法[J]. 计算机应用与软件, 2016, 33(12): 211-213.
doi: 10.3969/j.issn.1000-386x.2016.12.050
|
[7] |
(Xu Haoyu, Ren Zhihui, Shi Jun, et al.An Improved Chinese Word Segmentation Method Based on Chain Conditional Random Fields[J]. Computer Applications and Software, 2016, 33(12): 211-213.)
doi: 10.3969/j.issn.1000-386x.2016.12.050
|
[8] |
邓丽萍, 罗智勇.基于半监督CRF的跨领域中文分词[J]. 中文信息学报, 2017, 31(4): 9-19.
|
[8] |
(Deng Liping, Luo Zhiyong.Domain Adaptation of Chinese Word Segmentation on Semi-Supervised Conditional Random Fields[J]. Journal of Chinese Information Processing, 2017, 31(4): 9-19.)
|
[9] |
Collobert R, Weston J.A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning[C]// Proceedings of the 25th International Conference on Machine Learning. 2008: 160-167.
|
[10] |
Zheng X, Chen H, Xu T.Deep Learning for Chinese Word Segmentation and POS Tagging[C]// Proceedings of the 2015 International Conference on Empirical Methods in Natural Language Processing. 2013.
|
[11] |
Chen X, Qiu X, Zhu C, et al.Long Short-Term Memory Neural Networks for Chinese Word Segmentation[C]// Proceedings of the 2015 International Conference on Empirical Methods in Natural Language Processing. 2015: 1197-1206.
|
[12] |
Yao K, Cohn T, Vylomova K, et al. Depth-Gated Recurrent Neural Networks[OL]. arXiv Preprint, arXiv: 1508.03790, 2015.
|
[13] |
张子睿, 刘云清. 基于BI-LSTM-CRF模型的中文分词法[J].长春理工大学学报: 自然科学版, 2017, 40(4): 87-92.
|
[13] |
(Zhang Zirui, Liu Yunqing.Chinese Word Segmentation Based on Bi-directional LSTM-CRF Model[J]. Journal of Changchun University of Science and Technology, 2017, 40(4): 87-92)
|
[14] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781, 2013.
|
[15] |
王惠仙, 龙华. 基于改进的正向最大匹配中文分词算法研究[J]. 贵州大学学报: 自然版, 2011, 28(5): 112-115.
doi: 10.3969/j.issn.1000-5269.2011.05.027
|
[15] |
(Wang Huixian, Long Hua.The Research of Chinese Word Segmentation Algorithm Based on Forward Maximum Match[J]. Journal of Guizhou University: Natural Science, 2011, 28(5): 112-115.)
doi: 10.3969/j.issn.1000-5269.2011.05.027
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|