Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (2/3): 1-17     https://doi.org/10.11925/infotech.2096-3467.2019.1059
  专辑 本期目录 | 过刊浏览 | 高级检索 |
中文分词技术研究综述*
唐琳,郭崇慧(),陈静锋
大连理工大学系统工程研究所 大连 116024
Review of Chinese Word Segmentation Studies
Tang Lin,Guo Chonghui(),Chen Jingfeng
Institute of Systems Engineering, Dalian University of Technology, Dalian 116024, China
全文: PDF (2261 KB)   HTML ( 110
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 梳理中文分词领域的关键问题、算法和模型,为研究人员提供理论基础和实践指导。【文献范围】 使用知网数据库、万方数据知识服务平台和计算机科学文献库DBLP检索中文分词相关文献,共选择109篇代表性文献进行综述。【方法】 归纳中文分词的发展历程及关键问题,分类总结中文分词的算法和模型,并详述近期的热点研究问题。【结果】 使用多个标注数据集的多准则分词模型是中文分词的研究难点,解决中文分词和自然语言处理其他子任务的多任务联合模型是当前研究的热点。【局限】 没有深入对比分析中文分词的无监督学习方法。【结论】 虽然现有的中文分词方法能在一定程度上满足诸多应用的需求,但是在大数据环境下多视角、多任务和多准则的联合模型研究仍存在挑战。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
唐琳
郭崇慧
陈静锋
关键词 中文分词分词算法多准则学习联合模型    
Abstract

[Objective] This paper summarizes key issues, algorithms, and models from the field of Chinese word segmentation, aiming to provide theoretical basis and practical guidance for future research.[Coverage] We reviewed a total of 109 papers from CNKI, Wanfang Data Knowledge Service Platform, and DBLP Computer Science Bibliography.[Methods] First, we discussed the developments and critical issues facing Chinese word segmentation. Then, we explored algorithms and models for Chinese word segmentation. Finally, we identified popular research topics and trends.[Results] The main challenge facing researchers is creating a Multi-Criteria Learning Model for Chinese Word Segmentation with multiple annotation datasets. The most popular research topic is building Multi-task joint model to finish both Chinese word segmentation and other natural language processing tasks.[Limitations] More research is needed to review studies on unsupervised learning approaches for Chinese word segmentation.[Conclusions] The existing methods of Chinese word segmentation still face challenges in building joint models with multi-perspective, multi-task, and multi-criterion features.

Key wordsChinese    Word    Segmentation    Word    Segmentation    Algorithm    Multi-Criteria    Learning    Joint    Model
收稿日期: 2019-09-23      出版日期: 2020-04-26
ZTFLH:  TP393  
基金资助:*本文系国家自然科学基金项目“电子病历挖掘中的聚类模型与算法研究”(71771034);揭阳市科技计划项目“大数据驱动的中药材产业发展决策支持系统”的研究成果之一(2017xm041)
通讯作者: 郭崇慧     E-mail: dlutguo@dlut.edu.cn
引用本文:   
唐琳,郭崇慧,陈静锋. 中文分词技术研究综述*[J]. 数据分析与知识发现, 2020, 4(2/3): 1-17.
Tang Lin,Guo Chonghui,Chen Jingfeng. Review of Chinese Word Segmentation Studies. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 1-17.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1059      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I2/3/1
Fig. 1  “中文分词文献”发表年份的数量分布
Fig.2  “中文分词文献”标题名词共现网络
Fig.3  中文分词相关会议与评测的主题及时间分布
年份 作者 研究方法 来源 封闭测试 开放测试
PKU MSR CityU AS PKU MSR CityU AS
2018 Zhang等[12] 结合词典的深度学习方法 AAAI - - - - 96.5 97.8 96.3 95.9
2017 Cai等[13] 基于字和词的深度学习方法 ACL 95.4 97.0 95.4 95.2 95.8 97.1 95.6 95.3
2015 Chen等[14] 基于深度学习的长短期记忆网络 EMNLP 94.3 95.0 - - 96.5 97.4 - -
2012 Sun等[15] 基于丰富特征的现联合学习模型同时学习中文分词和新词发现 ACL 95.4 97.4 94.8 - - - - -
2010 Zhao等[16] 基于字的6位标注方法 TALIP - - - - - 98.3 97.8 96.1
2008 Zhao等[17] 非监督分词辅助基于字的条件随机场方法 SIGHAN 95.4 97.6 96.1 95.7 - - - -
2007 Zhang等[18] 基于词的判别式感知机方法 ACL 94.5 97.2 94.6 96.5 - - - -
2005 Bakeoff 评测结果 评测 95.0 96.4 94.3 95.2 96.9 97.2 96.2 95.6
Table1  SIGHAN2005数据集上的F值测试结果(%)
Fig.4  关键字词云
Fig.5  “中文分词文献”部分关键词分布(篇)
Fig.6  神经网络和深度学习方法相关关键字分布
Fig.7  中文分词研究现状
Fig.8  基于深度学习的中文分词流程
年份 作者 来源 研究思路 研究方法 实验使用的数据集
2019 Gong等[24] AAAI 方法改进 模型由多个长短时记忆神经网络(LSTM)和一个切换器组成,可以在这些LSTM之间自动切换。 SIGHAN2005[11](MSR、AS)
SIGHAN2008[84](PKU、CTB、SKIP、CityU、NCC、SXU)
2019 Huang等[85] arXiv 方法改进 基于Bidirectional Encoder Representations (BERT),使用模型剪枝、量化和编译器优化。 CTB6[72]
SIGHAN2005[11] (CityU、PKU、MSR、AS)
SIGHAN2008[84] (SXU)
CoNLL2017[86](UD)
2019 Qiu等[87] arXiv 方法改进 基于Transformer的构架方法采用全连接自注意力机制。 SIGHAN2005[11] (CityU、PKU、MSR、AS)
SIGHAN2008[84] (CTB、SKIP、NCC、SXU)
2019 He等[88] SCI 语料改进 每一个句子的开头和结尾增加人工标记,以区分多粒度语料。再使用LSTM和CRF实现多粒度分词。 SIGHAN2005[11] (MSR、 AS、PKU)
SIGHAN2008[84] (CTB、SKIP、CityU、NCC、SXU)
2019 张文静等[82] 中文信息学报 语料改进
方法改进
模型在网格结构的辅助下,对不同粒度的分词标准都有较强的捕捉能力,且不局限于单一的分词标准。 MSR[89]、PPD[90]、CTB[72]
2017 Chen等[91] ACL 方法改进 借鉴多任务学习的思想,融合多个语料的数据提升共享字向量模块。在此基础上应用对抗网络,把私有信息从共享模块中剥离到各个私有模块中去,既有大数据量的优势,又避免了不同语料之间的相互制约。 SIGHAN2005[11] (MSR、AS)
SIGHAN2008[84](PKU、CTB、SKIP、CityU、NCC、SXU)
2017 Gong等[83] EMNLP 语料改进 构建多粒度语料库。 MSR[89]、PPD[90]、CTB[72]
Table2  多粒度、多准则分词文献对比分析表
任务类型 发表年份 作者 来源 研究方法
自然语言统一处理框架 2008 Collobert等[93] ICML 基于深度学习的CNN模型,首次提出自然语言处理统一框架。该框架同时考虑词性标注、浅层语义分析、命名实体识别、语义角色标注进行多任务学习。
中文分词和词性标注 2004 Ng等[76] EMNLP 定义了一种交叉标记方式,能够同时标注两个任务的结果。
2010 Zhang等[94] ACL 基于线性的单模型,通过柱搜索的方法提升解码效率。
2013 Zeng等[95] ACL 基于半监督的方法,采用基于图标签传播的技术。
2013 Qiu等[96] EMNLP 为异质标注语料构建松散的、具有不确定性的映射,在进行训练同时提高异质标注语料的分词及词性标注的准确性。
2013 Zheng等[97] EMNLP 引入深度学习的方法自动学习特征,从而避免了人工的特征筛选,再结合传统的CRF。
2016 Wang等[98] ICIIP 基于层次长短时记忆,在一个目标函数中同时对多个任务进行联合训练,避免了管道模型性的错误传播问题。
2016 Chen等[99] arXiv 提出一种长距依赖的深度框架,基于联合模型同时完成分词和词性标注任务。
2017 Chen等[100] IJCAI 针对中文分词和词性标注任务提出富特征的深度学习框架,也是一种联合模型。该模型也能够解决长距依赖的问题。
中文分词、词性标注和依存句法 2012 Hatori等[101] ACL 提出一种增量的多任务处理联合模型,首次提出能够同时处理中文分词、词性标注和依存句法的联合模型。
2013 Wang等[102] ACL 使用基于晶格的结构,首先句子被划分为词格,在此基础上进行词性标注和依存句法分析,是一种联合模型。
2016 Guo等[103] IEICE Transactions 提出基于字级别的半监督联合处理模型,能够从部分标注的语料中得到N-Gram特征和依赖子树特征。
2016 Shen等[104] COLING 提出一种新颖的标注方法,该方法能够克服传统基于词法标注的两个问题:不一致性问题和稀疏性问题。
中文分词和依存句法分析 2019 Yan等[105] arXiv 首次提出处理中文分词和依存句法的统一模型,该模型为基于图的深度学习模型。
中文分词和未登录词 2015 Li等[106] TALLIP 提出一种基于字的生成式模型,能同时进行分词和未登录词检测。未登录词主要包括:词典中不包含的词、命名实体和后缀衍生词。
中文分词和非正式词检测 2017 Zhang等[107] IJCAI 中文微博语料中存在非正式用词的问题,传统的分词模型不能很好地对该类语料进行分词。针对这个问题,本文提出基于深度学习的分词和非正式词检测的联合模型。
中文分词和中文正确拼写 2017 Shi等[108] SMP 基于注意力机制的Encoder-Decoder架构提出一种序列到序列的标注方法,能够解决中文分词和中文拼写正确性问题。
中文分词和命名实体识别 2019 Wu等[109] WWW 提出一种新的框架CNER,综合使用了深度学习的CNN、LSTM和CRF。该框架能够在分词的同时识别命名实体。
Table 3  中文分词相关的多任务联合模型文献分析
[1] GB/T 13715-1992, 信息处理用现代汉语分词规范[S]. 北京: 中国标准出版社, 1993.
[1] ( GB/T 13715-1992, Contemporary Chinese Language Word Segmentation Specification for Information Processing[S]. Beijing: Standards Press of China, 1993.)
[2] 梁南元 . 计算机应用与软件[J]. 计算机应用与软件, 1987(3):44-50.
[2] ( Liang Nanyuan . An Introduction to Automatic Distinguishing of Written Chinese Words[J]. Computer Applications and Software, 1987(3):44-50.)
[3] 刘开瑛 . 语言文字应用[J]. 语言文字应用, 1997(1):103-108.
[3] ( Liu Kaiying . Research on Automatic Word Segmentation Assessment Technology in Modern Chinese[J]. Applied Linguistics, 1997(1):103-108.)
[4] 孙茂松 . 汉语自动分词研究的若干最新进展——清华大学相关工作简介[C]// 中国中文信息学会二十周年学术会议, 北京. 北京: 清华大学出版社, 2001: 44-50.
[4] ( Sun Maosong. Some Recent Advances in the Study of Chinese Automatic Word Segmentation: A Brief Introduction to the Work of Tsinghua University[C]// Proceedings of the 20th Anniversary Academic Conference of Chinese Information Processing Society of China, Beijing. Beijing: Tsinghua University Press, 2001: 44-50.)
[5] 黄昌宁, 赵海 . 中文分词十年回顾[J]. 中文信息学报, 2007,21(3):8-19.
[5] ( Huang Changning, Zhao Hai . Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007,21(3):8-19.)
[6] 何莘, 王琬芜 . 自然语言检索中的中文分词技术研究进展及应用[J]. 情报科学, 2008,26(5):787-791.
[6] ( He Zi, Wang Wanwu . Research and Application of Chinese Word Segmentation Technology Based on Natural Language Information Retrieval[J]. Information Science, 2008,26(5):787-791.)
[7] 奉国和, 郑伟 . 国内中文自动分词技术研究综述[J]. 图书情报工作, 2011,55(2):41-45.
[7] ( Feng Guohe, Zheng Wei . Review of Chinese Automatic Word Segmentation[J]. Library and Information Service, 2011,55(2):41-45.)
[8] 赵芳芳, 蒋志鹏, 关毅 . 中文分词和词性标注联合模型综述[J]. 智能计算机与应用, 2014,4(3):77-80.
[8] ( Zhao Fangfang, Jiang Zhipeng, Guan Yi . The Review on the Joint Model of Chinese Word Segmentation and Part-of-speech Tagging[J]. Intelligent Computer and Applications, 2014,4(3):77-80.)
[9] 梁喜涛, 顾磊 . 中文分词与词性标注研究[J]. 计算机技术与发展, 2015,25(2):175-180.
[9] ( Liang Xitao, Gu Lei . Study on Word Segmentation and Part-of-speech Tagging[J]. Computer Technology and Development, 2015,25(2):175-180.)
[10] 赵海, 蔡登, 黄昌宁 . 中文分词十年又回顾(2007-2017 [A]// 揭春雨, 刘美君. 实证及语料库语言学前沿[M]. 北京: 中国社会科学出版社, 2017.
[10] ( Zhao Hai, Cai Deng, Huang Changning. Chinese Word Segmentation: Review (2007-2017[A]//Jie Chunyu, Liu Meijun. Frontiers of Empirical and Corpus Linguistics[M]. Beijing: China Social Sciences Press, 2017.)
[11] Emerson T . The Second International Chinese Word Segmentation Bakeoff [C]// Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea. New York, USA: ACL, 2005: 123-133.
[12] Zhang Q, Liu X, Fu J . Neural Networks Incorporating Dictionaries for Chinese Word Segmentation [C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA. California, USA: AAAI, 2018.
[13] Cai D, Zhao H, Zhang Z , et al. Fast and Accurate Neural Word Segmentation for Chinese [C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. USA: ACL, 2017: 608-615.
[14] Chen X, Qiu X, Zhu C , et al. Long Short-Term Memory Neural Networks for Chinese Word Segmentation [C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. New York, USA: ACL, 2015: 1197-1206.
[15] Sun X, Wang H, Li W . Fast Online Training with Frequency-adaptive Learning Rates for Chinese Word Segmentation and New Word Detection [C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea. USA: ACL, 2012: 253-262.
[16] Zhao H, Huang C N, Li M , et al. A Unified Character-based Tagging Framework for Chinese Word Segmentation[J]. ACM Transactions on Asian Language Information Processing (TALIP), 2010, 9(2):Article No. 5.
[17] Zhao H, Kit C . Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition [C]// Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing, Hyderabad, India. New York, USA: ACL, 2008: 106-111.
[18] Zhang Y, Clark S . Chinese Segmentation with a Word-based Perceptron Algorithm [C]// Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic. USA: ACL, 2007: 840-847.
[19] Sproat R, Emerson T . The First International Chinese Word Segmentation Bakeoff [C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. New York, USA: ACL, 2003: 133-143.
[20] 王换换 . 基于中文分词技术的药品适应症相似性研究[D]. 淮南: 安徽理工大学, 2015.
[20] ( Wang Huanhuan . Indication Similarity of Drugs Based on Chinese Word Segmentation Technology[D]. Huainan: Anhui University of Science & Technology, 2015.)
[21] 赵浩新, 俞敬松, 林杰 . 基于笔画中文字向量模型设计与研究[J]. 中文信息学报, 2019,33(5):17-23.
[21] ( Zhao Haoxin, Yu Jingsong, Lin Jie . Design and Research on Chinese Word Embedding Model Based on Strokes[J]. Journal of Chinese Information Processing, 2019,33(5):17-23.)
[22] 张涛 . 中文文本自动校对系统设计与实现[D]. 成都: 西南交通大学, 2017.
[22] ( Zhang Tao . Design and Implementation of Chinese Text Automatic Proofreading System[D]. Chengdu: Southwest Jiaotong University, 2017.)
[23] Richard S, Shih C, Gale W , et al. A Stochastic Finite-State Word-Segmentation Algorithm for Chinese [C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New Mexico, USA. New York, USA: ACL, 1994: 66-73.
[24] Gong J, Chen X, Gui T , et al. Switch-LSTMs for Multi-Criteria Chinese Word Segmentation [C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA. California, USA: AAAI, 2019: 6457-6464.
[25] 刘健, 张维明 . 一种快速的交集型歧义检测方法[J]. 计算机应用研究, 2008,25(11):3259-3261.
[25] ( Liu Jian, Zhang Weiming . Fast Crossing Ambiguity Detection Method[J]. Application Research of Computers, 2008,25(11):3259-3261.)
[26] 秦颖, 王小捷, 张素香 . 汉语分词中组合歧义字段的研究[J]. 中文信息学报, 2007,21(1):3-8.
[26] ( Qin Ying, Wang Xiaojie, Zhang Suxiang . Research on Combinational Ambiguity in Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2007,21(1):3-8.)
[27] 郑家恒, 张剑锋, 谭红叶 . 中文分词中歧义切分处理策略[J]. 山西大学学报:自然科学版, 2007,30(2):163-167.
[27] ( Zheng Jiaheng, Zhang Jianfeng, Tan Hongye . Segmentation Strategies on Ambiguity String in Chinese Word Segmentation[J]. Journal of Shanxi University: Natural Science Edition, 2007,30(2):163-167.)
[28] Humphreys K, Gaizauskas R, Azzam S , et al. University of Sheffield: Description of the LaSIE-II System as Used for MUC-7 [C]// Proceedings of the 7th Message Understanding Conference, Virginia, USA. New York, USA: ACL, 1998.
[29] 孙茂松, 左正平, 黄昌宁 . 汉语自动分词词典机制的实验研究[J]. 中文信息学报, 2000,14(1):1-6.
[29] ( Sun Maosong, Zuo Zhengping, Huang Changning . An Experimental Study on Dictionary Mechanism for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2000,14(1):1-6.)
[30] Sproat R, Shih C . A Statistical Method for Finding Word Boundaries in Chinese Text[J]. Computer Processing of Chinese and Oriental Languages, 1990,4(4):336-351.
[31] Huang C N, Zhao H. Which is Essential for Chinese Word Segmentation: Character Versus Word[C]// Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, Wuhan, China. Beijing, China: Tsinghua University Press, 2006: 1-12.
[32] Xue N . Chinese Word Segmentation as Character Tagging[J]. Computational Linguistics & Chinese Language Processing, 2003,8(1):29-47.
[33] Xue N, Converse S P . Combining Classifiers for Chinese Word Segmentation [C]// Proceedings of the 1st SIGHAN Workshop on Chinese Language Processing, Taipei, China. New York, USA: ACL, 2002.
[34] Low J K, Ng H T, Guo W . A Maximum Entropy Approach to Chinese Word Segmentation [C]// Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea. New York, USA: ACL, 2005.
[35] Berger A L, Pietra V J D, Pietra S A D, . A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistics, 1996,22(1):39-71.
[36] Rabiner L R . A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition[J]. Proceedings of the IEEE, 1989,77(2):257-286.
[37] McCallum A, Freitag D, Pereira F C N . Maximum Entropy Markov Models for Information Extraction and Segmentation [C]// Proceedings of the 17th International Conference on Machine Learning, CA, USA. CA, USA: ICMS, 2000.
[38] Peng F, Feng F, McCallum A . Chinese Segmentation and New Word Detection Using Conditional Random Fields [C]// Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland. New York, USA: ACL, 2004.
[39] Tseng H, Chang P, Andrew G , et al. A Conditional Random Field Word Segmenter for SIGHAN Bakeoff 2005 [C]// Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea. New York, USA: ACL, 2005.
[40] Lafferty J, McCallum A, Pereira F C N . Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]// Proceedings of the 18th International Conference on Machine Learning, MA, USA. CA, USA: ICMS, 2001: 282-289.
[41] 修驰 . 适应于不同领域的中文分词方法研究与实现[D]. 北京: 北京工业大学, 2013.
[41] ( Xiu Chi . The Research and Implementation of Method for Domain Chinese Word Segmentation[D]. Beijing: Beijing University of Technology, 2013.)
[42] Lü X, Zhang L, Hu J . Statistical Substring Reduction in Linear Time [C]// Proceedings of the 2004 International Conference on Natural Language Processing, Hainan, China. 2004.
[43] Kitt C, Wilks Y . Unsupervised Learning of Word Boundary with Description Length Gain [C]// Proceedings of the 3rd SIGNLL Conference on Computational Natural Language Learning, Bergen, Norway. New York, USA: SIGNLL, 1999.
[44] Feng H, Chen K, Deng X , et al. Accessor Variety Criteria for Chinese Word Extraction[J]. Computational Linguistics, 2004,30(1):75-93.
[45] Huang J H, Powers D . Chinese Word Segmentation Based on Contextual Entropy [C]// Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, Sentosa, Singapore. New York, USA: ACL, 2003: 152-158.
[46] Chang J S, Lin T . Unsupervised Word Segmentation Without Dictionary [C]// Proceedings of the 15th Annual Conference on Computational Linguistics and Speech Processing. 2003.
[47] Chen S, Xu Y, Chang H . A Simple and Effective Unsupervised Word Segmentation Approach [C]// Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, USA. California, USA: AAAI, 2011.
[48] Magistry P, Sagot B . Unsupervized Word Segmentation: The Case for Mandarin Chinese [C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea. New York, USA: ACL, 2012: 383-387.
[49] Magistry P, Sagot B . Can MDL Improve Unsupervised Chinese Word Segmentation? [C]// Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing, Nagoya, Japan. New York, USA: ACL, 2013: 1-10.
[50] Chen M, Chang B, Pei W . A Joint Model for Unsupervised Chinese Word Segmentation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. New York, USA: ACL, 2014: 854-863.
[51] Goldwater S, Griffiths T L, Johnson M . A Bayesian Framework for Word Segmentation: Exploring the Effects of Context[J]. Cognition, 2009,112(1):21-54.
[52] Jiao F, Wang S, Lee C H , et al. Semi-supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling [C]// Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia. New York, USA: ACL, 2006: 209-216.
[53] Zhao H, Kit C . Integrating Unsupervised and Supervised Word Segmentation: The Role of Goodness Measures[J]. Information Sciences, 2011,181(1):163-183.
[54] Zeng X, Wong D F, Chao L S , et al. Co-regularizing Character-based and Word-based Models for Semi-supervised Chinese Word Segmentation [C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria. New York, USA: ACL, 2013: 171-176.
[55] Yang T, Jiang T J, Kuo C , et al. Unsupervised Overlapping Feature Selection for Conditional Random Fields Learning in Chinese Word Segmentation [C]// Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing. 2011.
[56] Collobert R, Weston J, Bottou L , et al. Natural Language Processing (Almost) from Scratch[J]. Journal of Machine Learning Research, 2011,12:2493-2537.
[57] LeCun Y, Bottou L, Bengio Y , et al. Gradient-based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998,86(11):2278-2324.
[58] Vincent P, Larochelle H, Lajoie I , et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion[J]. Journal of Machine Learning Research, 2010,11:3371-3408.
[59] Chen X, Qiu X, Zhu C , et al. Gated Recursive Neural Network for Chinese Word Segmentation [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. New York, USA: ACL, 2015: 1744-1753.
[60] Cai D, Zhao H . Neural Word Segmentation Learning for Chinese [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. New York, USA: ACL, 2016.
[61] Graves A . Long Short-Term Memory[A]// Graves A. Supervised Sequence Labelling with Recurrent Neural Networks[M]. Berlin: Springer, 2012: 37-45.
[62] Schuster M, Paliwal K K . Bidirectional Recurrent Neural Networks[J]. IEEE Transactions on Signal Processing, 1997,45(11):2673-2681.
[63] Pei W, Ge T, Chang B . Max-margin Tensor Neural Network for Chinese Word Segmentation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, USA. New York, USA: ACL, 2014: 293-303.
[64] 张洪刚, 李焕 . 基于双向长短时记忆模型的中文分词方法[J]. 华南理工大学学报:自然科学版, 2017,45(3):61-67.
[64] ( Zhang Honggang, Li Huan . Chinese Word Segmentation Method on the Basis of Bidirectional Long-Short Term Memory Model[J]. Journal of South China University of Technology: Natural Science Edition, 2017,45(3):61-67.)
[65] Ma J, Ganchev K, Weiss D . State-of-the-art Chinese Word Segmentation with BI-LSTMs [C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. New York, USA: ACL, 2018: 4902-4908.
[66] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space [C]// Proceedings of the 1st International Conference on Learning Representations, Arizona, USA. New York, USA: ACL, 2013.
[67] Pennington J, Socher R, Manning C . Glove: Global Vectors for Word Representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. New York, USA: ACL, 2014: 1532-1543.
[68] Peters M E, Neumann M, Iyyer M , et al. Deep Contextualized Word Representations [C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, New Orleans, USA. New York, USA: ACL, 2018: 2227-2237.
[69] Vaswani A, Shazeer N, Parmar N , et al. Attention is All You Need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA. San Diego, CA: NIPS, 2017: 5998-6008.
[70] Yang Z, Dai Z, Yang Y , et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding[OL]. arXiv Preprint, arXiv: 1906. 08237.
[71] Wang J, Zhou J, Zhou J , et al. Multiple Character Embeddings for Chinese Word Segmentation [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. New York, USA: ACL, 2019: 210-216.
[72] Xue N, Xia F, Chiou F D , et al. The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus[J]. Natural Language Engineering, 2005,11(2):207-238.
[73] Liu J, Wu F, Wu C , et al. Neural Chinese Word Segmentation with Dictionary[J]. Neurocomputing, 2019,338:46-54.
[74] Zhao H, Liu Q . The CIPS-SIGHAN CLP2010 Chinese Word Segmentation Backoff [C]// Proceedings of the 2010 CIPS-SIGHAN Joint Conference on Chinese Language Processing, Beijing, China. New York, USA: ACL, 2010.
[75] Zhang R, Kikui G, Sumita E . Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation [C]// Proceedings of the 2006 Human Language Technology Conference of the NAACL, New York, USA. New York, USA: ACL, 2006: 193-196.
[76] Ng H T, Low J K . Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? [C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. New York, USA: ACL, 2004: 277-284.
[77] 张梅山, 邓知龙, 车万翔 , 等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报, 2012,26(2):8-12.
[77] ( Zhang Meishan, Deng Zhilong, Che Wanxiang , et al. Combining Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2012,26(2):8-12.)
[78] Huang Z, Xu W, Yu K . Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508. 01991.
[79] Ma X, Hovy E . End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. USA: ACL, 2016: 1064-1074.
[80] Yao Y, Huang Z . Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation [C]// Proceedings of the 23rd International Conference on Neural Information Processing, Kyoto, Japan. Illinois, USA: INNS, 2016: 345-353.
[81] 冯国明, 张晓冬, 刘素辉 . 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018,2(5):40-47.
[81] ( Feng Guoming, Zhang Xiaodong, Liu Suhui . DBLC Model for Word Segmentation Based on Autonomous Learning[J]. Data Analysis and Knowledge Discovery, 2018,2(5):40-47.)
[82] 张文静, 张惠蒙, 杨麟儿 , 等. 基于Lattice-LSTM的多粒度中文分词[J]. 中文信息学报, 2019,33(1):18-24.
[82] ( Zhang Wenjing, Zhang Huimeng, Yang Liner , et al. Multi-grained Chinese Word Segmentation with Lattice-LSTM[J]. Journal of Chinese Information Processing, 2019,33(1):18-24.)
[83] Gong C, Li Z, Zhang M , et al. Multi-grained Chinese Word Segmentation [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark. New York, USA: ACL, 2017: 692-703.
[84] Jin G, Chen X . The Fourth International Chinese Language Processing BakeOff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging [C]// Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing, Hyderabad, India. New York, USA: ACL, 2008: 69-81.
[85] Huang W, Cheng X, Chen K , et al. Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning[OL]. arXiv Preprint, arXiv: 1903. 04190.
[86] Zeman D, Popel M, Straka M , et al. CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies [C]// Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada. New York, USA: ACL, 2017: 1-19.
[87] Qiu X, Pei H, Yan H , et al. Multi-Criteria Chinese Word Segmentation with Transformer[OL]. arXiv Preprint, arXiv: 1906. 12035.
[88] He H, Wu L, Yan H , et al. Effective Neural Solution for Multi-Criteria Word Segmentation[A]// Satapathy S C, Bhateja V, Das S. Smart Intelligent Computing and Applications[M]. Springer, 2019: 133-142.
[89] 黄昌宁, 李玉梅, 朱晓丹 . 中文文本标注规范(5.0版)[Z]. 微软亚洲研究院, 2006.
[89] ( Huang Changning, Li Yumei, Zhu Xiaodan . Tokenization Guidelines of Chinese Text (V5. 0)[Z]. Microsoft Research Asia, 2006.)
[90] Yu S . Specification for Corpus Processing at Peking University: Word Segmentation, POS Tagging and Phonetic Notation[J]. Chinese Language and Computing, 2003,13:121-158.
[91] Chen X, Shi Z, Qiu X , et al. Adversarial Multi-Criteria Learning for Chinese Word Segmentation [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark. New York, USA: ACL, 2017: 1193-1203.
[92] Kipf T N, Welling M . Semi-supervised Classification with Graph Convolutional Networks [C]// Proceedings of the 5th International Conference on Learning Representations, Toulon, France. New York, USA: ACL, 2017.
[93] Collobert R, Weston J . A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning [C]// Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. New York, USA: ACM, 2008: 160-167.
[94] Zhang Y, Clark S . A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model [C]// Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Massachusetts, USA. New York, USA: ACL, 2010: 843-852.
[95] Zeng X, Wong D F, Chao L S , et al. Graph-based Semi-supervised Model for Joint Chinese Word Segmentation and Part-of-speech Tagging [C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria. New York, USA: ACL, 2013: 770-779.
[96] Qiu X, Zhao J, Huang X . Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA. New York, USA: ACL, 2013: 658-668.
[97] Zheng X, Chen H, Xu T . Deep Learning for Chinese Word Segmentation and POS Tagging [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA. New York, USA: ACL, 2013: 647-657.
[98] Wang H J, Si N W, Chen C . An Effective Joint Model for Chinese Word Segmentation and POS Tagging [C]// Proceedings of the 2016 International Conference on Intelligent Information Processing, Wuhan, China. New York, USA: ACM, 2016.
[99] Chen X, Qiu X, Huang X . A Long Dependency Aware Deep Architecture for Joint Chinese Word Segmentation and POS Tagging[OL]. arXiv Preprint, arXiv: 1611. 05384.
[100] Chen X, Qiu X, Huang X . A Feature-enriched Neural Model for Joint Chinese Word Segmentation and Part-of-speech Tagging [C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia. California, USA: IJCAI, 2017: 3960-3966.
[101] Hatori J, Matsuzaki T, Miyao Y , et al. Incremental Joint Approach to Word Segmentation, POS Tagging , and Dependency Parsing in Chinese [C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea. New York, USA: ACL, 2012: 1045-1053.
[102] Wang Z, Zong C, Xue N . A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing [C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria. New York, USA: ACL, 2013: 623-627.
[103] Guo Z, Zhang Y, Su C , et al. Character-level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese[J]. IEICE Transactions on Information and Systems, 2016,99(1):257-264.
[104] Shen M, Li W, Choe H J , et al. Consistent Word Segmentation, Part-of-speech Tagging and Dependency Labelling Annotation for Chinese Language [C]// Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Japan. New York, USA: COLING, 2016: 298-308.
[105] Yan H, Qiu X, Huang X . A Unified Model for Joint Chinese Word Segmentation and Dependency Parsing[OL]. arXiv Preprint, arXiv: 1904. 04697.
[106] Li X, Zong C, Su K . A Unified Model for Solving the OOV Problem of Chinese Word Segmentation[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2015,14(3):12-29.
[107] Zhang M, Fu G, Yu N . Segmenting Chinese Microtext: Joint Informal-Word Detection and Segmentation with Neural Networks [C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia. California, USA: IJCAI, 2017: 4228-4234.
[108] Shi X, Huang H, Jian P , et al. Neural Chinese Word Segmentation as Sequence to Sequence Translation [C]// Proceedings of the Chinese National Conference on Social Media Processing, Beijing, China. Berlin, Germany: Springer, 2017: 91-103.
[109] Wu F, Liu J, Wu C , et al. Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation [C]// Proceedings of the 2019 World Wide Web Conference, CA, USA. New York, USA: ACM, 2019: 3342-3348.
[1] 尤众喜,华薇娜,潘雪莲. 中文分词器对图书评论和情感词典匹配程度的影响 *[J]. 数据分析与知识发现, 2019, 3(7): 23-33.
[2] 冯国明, 张晓冬, 刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[3] 倪维健, 孙浩浩, 刘彤, 曾庆田. 面向领域文献的无监督中文分词自动优化方法*[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[4] 张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[5] 余昕聪, 李红莲, 吕学强. 本体上下位关系在招生问答机器人中的应用研究[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[6] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[7] 李文江, 陈诗琴. AIMLBot智能机器人在实时虚拟参考咨询中的应用[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[8] 江华, 苏晓光. 无词典中文高频词快速抽取算法[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[9] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[10] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[11] 常智荣,马自卫,李高虎. 基于Nutch的专题网页资源采集服务系统的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 19-26.
[12] 程肖, 陆蓓, 谌志群. 热点主题词提取方法研究[J]. 现代图书情报技术, 2010, 26(10): 43-48.
[13] 谢蕙,秦杰,胡双双. 基于用户查询关键词的网页去重方法研究[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[14] 张金柱,张东,王惠临. 基于字位信息的中文分词方法研究*[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[15] 姚兴山. 基于Hash算法的中文分词的研究[J]. 现代图书情报技术, 2008, 24(3): 78-81.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn