Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (5): 40-47    DOI: 10.11925/infotech.2096-3467.2017.1302
DBLC Model for Word Segmentation Based on Autonomous Learning
Feng Guoming, Zhang Xiaodong(), Liu Suhui
School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
[Objective] This paper tries to improve the accuracy of word segmentation for literature with lots of scientific terms. [Methods] First, we programed the DBLC model, which combined the methods of dictionary, statistics and deep learning. Then, we retrieved articles from the Chinese Management Case Center to build the experimental corpus. Finally, we compared the performance of this new model with the existing ones. [Results] The performance of the DBLC model was better than others. Its word segmentation accuracy was up to 96.3%. [Limitations] We did not separate the words of the original dictionary from the new words. We did not re-design the storage structure of the dictionary, which prolonged the computing time of our model. [Conclusions] The proposed DBLC model improves the accuracy of word segmentation, which is also positively co-related to the dictionary size.

Key wordsChinese Word Segmentation      Sequence Labeling      BI-LSTM-CRF      Autonomous Learning      Word Segmentation Based on Dictionary     
Received: 21 December 2017      Published: 20 June 2018
ZTFLH:  G350  

Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning. Data Analysis and Knowledge Discovery, 2018, 2(5): 40-47.

训练集 测试集 合计字数
战略管理 33篇 7篇 61万
项目管理 33篇 7篇 71万
管理信息系统 33篇 7篇 58万
合计字数 143万 47万 190万
方法 分词结果 方法 分词结果
FMM 百丽/鞋业/采取/纵向/一体化/敏捷供应链/的/业务/模式 BI-LSTM-CRF 百丽鞋业/采取/纵向/一体化/敏捷供应链/的/业务模式
JIEBA 百丽/鞋业/采取/纵向/一体化/敏捷供应链/的/业务模式 DBLC 百丽鞋业/采取/纵向一体化(NW)/敏捷供应链/的/业务模式
CRF 百丽鞋业/采取/纵向/一体化/敏捷供应链/的/业务模式 正确分词 百丽鞋业/采取/纵向一体化/敏捷供应链/的/业务模式
战略管理 项目管理 管理信息系统 汇总
FMM 0.837 - - 0.820 - - 0.874 - - 0.849 - -
JIEBA 0.892 - - 0.867 - - 0.894 - - 0.868 - -
CRF 0.917 0.909 0.913 0.889 0.882 0.885 0.925 0.919 0.922 0.891 0.885 0.888
BI-LSTM-CRF 0.903 0.886 0.894 0.934 0.93 0.932 0.907 0.896 0.901 0.946 0.939 0.942
DBLC 0.937 0.927 0.932 0.951 0.946 0.948 0.927 0.921 0.924 0.963 0.95 0.957
