Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (5): 40-47    DOI: 10.11925/infotech.2096-3467.2017.1302
Orginal Article Current Issue | Archive | Adv Search |
DBLC Model for Word Segmentation Based on Autonomous Learning
Feng Guoming, Zhang Xiaodong(), Liu Suhui
School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
Download: PDF (613 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to improve the accuracy of word segmentation for literature with lots of scientific terms. [Methods] First, we programed the DBLC model, which combined the methods of dictionary, statistics and deep learning. Then, we retrieved articles from the Chinese Management Case Center to build the experimental corpus. Finally, we compared the performance of this new model with the existing ones. [Results] The performance of the DBLC model was better than others. Its word segmentation accuracy was up to 96.3%. [Limitations] We did not separate the words of the original dictionary from the new words. We did not re-design the storage structure of the dictionary, which prolonged the computing time of our model. [Conclusions] The proposed DBLC model improves the accuracy of word segmentation, which is also positively co-related to the dictionary size.

Key wordsChinese Word Segmentation      Sequence Labeling      BI-LSTM-CRF      Autonomous Learning      Word Segmentation Based on Dictionary     
Received: 21 December 2017      Published: 20 June 2018
ZTFLH:  G350  

Cite this article:

Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning. Data Analysis and Knowledge Discovery, 2018, 2(5): 40-47.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.1302     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I5/40

类别
数据集
训练集 测试集 合计字数
战略管理 33篇 7篇 61万
项目管理 33篇 7篇 71万
管理信息系统 33篇 7篇 58万
合计字数 143万 47万 190万
方法 分词结果 方法 分词结果
FMM 百丽/鞋业/采取/纵向/一体化/敏捷供应链/的/业务/模式 BI-LSTM-CRF 百丽鞋业/采取/纵向/一体化/敏捷供应链/的/业务模式
JIEBA 百丽/鞋业/采取/纵向/一体化/敏捷供应链/的/业务模式 DBLC 百丽鞋业/采取/纵向一体化(NW)/敏捷供应链/的/业务模式
CRF 百丽鞋业/采取/纵向/一体化/敏捷供应链/的/业务模式 正确分词 百丽鞋业/采取/纵向一体化/敏捷供应链/的/业务模式
数据集
指标值
方法
战略管理 项目管理 管理信息系统 汇总
P R F P R F P R F P R F
FMM 0.837 - - 0.820 - - 0.874 - - 0.849 - -
JIEBA 0.892 - - 0.867 - - 0.894 - - 0.868 - -
CRF 0.917 0.909 0.913 0.889 0.882 0.885 0.925 0.919 0.922 0.891 0.885 0.888
BI-LSTM-CRF 0.903 0.886 0.894 0.934 0.93 0.932 0.907 0.896 0.901 0.946 0.939 0.942
DBLC 0.937 0.927 0.932 0.951 0.946 0.948 0.927 0.921 0.924 0.963 0.95 0.957
[1] 刘源, 谭强, 沈旭昆. 信息处理用现代汉语分词规范及自动分词方法[M]. 北京: 中国标准出版社, 1994.
[1] (Liu Yuan, Tan Qiang, Shen Xukun.Modern Chinese Word Segmentation and Automatic Word Segmentation Method for Information Processing[M]. Beijing: Standards Press of China, 1994.)
[2] Sui Z, Chen Y, Hu J.The Research on the Automatic Term Extraction in the Domain of Information Science and Technology[C]// Proceedings of the 5th East Asia Forum of the Terminology. 2002.
[3] Xue N.Chinese Word Segmentation as Character Tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1): 29-47.
[4] 刘群, 张华平, 俞鸿魁, 等. 基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展, 2004, 41(8): 1421-1429.
[4] (Liu Qun, Zhang Huaping, Yu Hongkui, et al.Chinese Lexical Analysis Using Cascaded Hidden Markov Mode[J]. Journal of Computer Research and Development, 2004, 41(8): 1421-1429.)
[5] Peng F, Feng F, McCallum A. Chinese Segmentation and New Word Detection Using Conditional Random Fields[C]// Proceedings of the 20th International Conference on Computational Linguistics. 2004.
[6] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[7] 徐浩煜, 任智慧, 施俊, 等. 基于链式条件随机场的中文分词改进方法[J]. 计算机应用与软件, 2016, 33(12): 211-213.
doi: 10.3969/j.issn.1000-386x.2016.12.050
[7] (Xu Haoyu, Ren Zhihui, Shi Jun, et al.An Improved Chinese Word Segmentation Method Based on Chain Conditional Random Fields[J]. Computer Applications and Software, 2016, 33(12): 211-213.)
doi: 10.3969/j.issn.1000-386x.2016.12.050
[8] 邓丽萍, 罗智勇.基于半监督CRF的跨领域中文分词[J]. 中文信息学报, 2017, 31(4): 9-19.
[8] (Deng Liping, Luo Zhiyong.Domain Adaptation of Chinese Word Segmentation on Semi-Supervised Conditional Random Fields[J]. Journal of Chinese Information Processing, 2017, 31(4): 9-19.)
[9] Collobert R, Weston J.A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning[C]// Proceedings of the 25th International Conference on Machine Learning. 2008: 160-167.
[10] Zheng X, Chen H, Xu T.Deep Learning for Chinese Word Segmentation and POS Tagging[C]// Proceedings of the 2015 International Conference on Empirical Methods in Natural Language Processing. 2013.
[11] Chen X, Qiu X, Zhu C, et al.Long Short-Term Memory Neural Networks for Chinese Word Segmentation[C]// Proceedings of the 2015 International Conference on Empirical Methods in Natural Language Processing. 2015: 1197-1206.
[12] Yao K, Cohn T, Vylomova K, et al. Depth-Gated Recurrent Neural Networks[OL]. arXiv Preprint, arXiv: 1508.03790, 2015.
[13] 张子睿, 刘云清. 基于BI-LSTM-CRF模型的中文分词法[J].长春理工大学学报: 自然科学版, 2017, 40(4): 87-92.
[13] (Zhang Zirui, Liu Yunqing.Chinese Word Segmentation Based on Bi-directional LSTM-CRF Model[J]. Journal of Changchun University of Science and Technology, 2017, 40(4): 87-92)
[14] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781, 2013.
[15] 王惠仙, 龙华. 基于改进的正向最大匹配中文分词算法研究[J]. 贵州大学学报: 自然版, 2011, 28(5): 112-115.
doi: 10.3969/j.issn.1000-5269.2011.05.027
[15] (Wang Huixian, Long Hua.The Research of Chinese Word Segmentation Algorithm Based on Forward Maximum Match[J]. Journal of Guizhou University: Natural Science, 2011, 28(5): 112-115.)
doi: 10.3969/j.issn.1000-5269.2011.05.027
[1] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[2] Ni Weijian,Sun Haohao,Liu Tong,Zeng Qingtian. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[3] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[4] Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[5] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[6] Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[7] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[8] Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[9] Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[10] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[11] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[12] Zhang Jinzhu,Zhang Dong,Wang Huilin. The Research of Character-Position-Based Chinese Word Segmentation[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[13] Yao Xingshan. The Improvement in a Chinese Word Segmentation Based on Hash Algorism[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[14] Wu Shaogen . Study of Scheme Automaton for Chinese Word Automatic Segmentation[J]. 现代图书情报技术, 2006, 1(5): 47-49.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn