|
|
Extracting New Words with Mutual Information and Logistic Regression |
Xianlai Chen1,3,Chaopeng Han2,Ying An1,3,Li Liu1,Zhongmin Li1,Rong Yang4( ) |
1Information Security and Big Data Research Institute, Central South University, Changsha 410083, China 2School of Information Science and Engineering, Central South University, Changsha 410083, China 3National Engineering Laboratory for Medical Big Data Application Technology, Central South University, Changsha 410083, China 4Xiangya Hospital, Central South University, Changsha 410078, China |
|
|
Abstract [Objective] This paper modified the method for new word extraction, which are used to improve the performance of medical text segmentation models. [Methods] With the help of traditional mutual information model, we obtained the statistics of words and strings. Then, we established a logical regression classification model with these data, and built an algorithm for new word identification. [Results] A series of experiments were carried out on the texts of electronic medical records from Dermatology Department of Xiangya Hospital. Compared with PMI, PMI 2 and PMI 3, our model with logistic regression achieved the highest accuracy of new words extraction (0.803). [Limitations] To establish the logistic regression model for classification, we have to manually judge whether or not the training strings are words. [Conclusions] The proposed model and algorithm could effectively identify new words from medical records.
|
Received: 24 December 2018
Published: 29 September 2019
|
|
Corresponding Authors:
Rong Yang
E-mail: cxlyr0576@163.com
|
[1] |
雷健波 . 电子病历的核心价值与临床决策支持[J]. 中国数字医学, 2008,3(3):26-30.
|
[1] |
( Lei Jianbo . Clinical Decision Support and the Core Value of Electronic Medical Record[J]. China Digital Medicine, 2008,3(3):26-30.)
|
[2] |
李国垒, 陈先来, 夏冬 , 等. 面向临床决策的电子病历文本潜在语义分析[J]. 现代图书情报技术, 2016(3):50-57.
|
[2] |
( Li Guolei, Chen Xianlai, Xia Dong , et al. Latent Semantic Analysis of Electronic Medical Record Text for Clinical Decision Making[J]. New Technology of Library and Information Service, 2016(3):50-57.)
|
[3] |
Zhang S, Kang T, Zhang X , et al. Speculation Detection for Chinese Clinical Notes: Impacts of Word Segmentation and Embedding Models[J]. Journal of Biomedical Informatics, 2016,60:334-341.
|
[4] |
蒋志鹏, 赵芳芳, 关毅 , 等. 面向中文电子病历的词法语料标注研究[J]. 高技术通讯, 2014,24(6):609-615.
|
[4] |
( Jiang Zhipeng, Zhao Fangfang, Guan Yi , et al. Research on Chinese Electronic Medical Record Oriented Lexical Corpus Annotation[J]. Chinese High Technology Letters, 2014,24(6):609-615.)
|
[5] |
张立邦, 关毅, 杨锦峰 . 基于无监督学习的中文电子病历分词[J]. 智能计算机与应用, 2014,4(2):68-71.
|
[5] |
( Zhang Libang, Guan Yi, Yang Jinfeng . An Unsupervised Approach to Word Segmentation in Chinese EMRs[J]. Intelligent Computer and Applications, 2014,4(2):68-71.)
|
[6] |
Sui Z, Chen Y. The Research on the Automatic Term Extraction in the Domain of Information Science and Technology [C]// Proceedings of the 5th East Asia Forum of the Terminology. 2002.
|
[7] |
任智慧, 徐浩煜, 封松林 , 等. 基于LSTM网络的序列标注中文分词法[J]. 计算机应用研究, 2017,34(5):1321-1324.
|
[7] |
( Ren Zhihui, Xu Haoyu, Feng Songlin , et al. Sequence Labeling Chinese Word Segmentation Method Based on LSTM Networks[J]. Application Research of Computers, 2017,34(5):1321-1324.)
|
[8] |
Said L B, Bechikh S, Ghedira K . The R-Dominance: A New Dominance Relation for Interactive Evolutionary Multicriteria Decision Making[J]. IEEE Transactions on Evolutionary Computation, 2010,14(5):801-818.
|
[9] |
Xue N, Shen L. Chinese Word Segmentation as LMR Tagging [C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003: 176-179.
|
[10] |
Chen X, Qiu X, Zhu C, et al. Long Short-Term Memory Neural Networks for Chinese Word Segmentation [C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1197-1206.
|
[11] |
Chen X, Qiu X, Zhu C, et al. Gated Recursive Neural Network for Chinese Word Segmentation [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 2015,1:1744-1753.
|
[12] |
宗成庆 . 统计自然语言处理[M]. 北京: 清华大学出版社, 2008: 103-146.
|
[12] |
( Zong Chengqing. Statistical Natural Language Processing[M]. Beijing: Tsinghua University Press, 2008: 103-146.)
|
[13] |
Pecina P, Schlesinger P. Combining Association Measures for Collocation Extraction [C]// Proceedings of the 21st International Conference on Computational Linguistics. 2006: 651-658.
|
[14] |
刘华 . 一种快速获取领域新词语的新方法[J]. 中文信息学报, 2006,20(5):19-25.
|
[14] |
( Liu Hua . A New Approach for Domain New Words Detection[J]. Journal of Chinese Information Processing, 2006,20(5):19-25.)
|
[15] |
韩艳, 林煜熙, 姚建明 . 基于统计信息的未登录词的扩展识别方法[J]. 中文信息学报, 2009,23(3):24-30.
|
[15] |
( Han Yan, Lin Yuxi, Yao Jianming . Study on Chinese OOV Identification Based on Extension[J]. Journal of Chinese Information Processing, 2009,23(3):24-30.)
|
[16] |
梁颖红, 张文静, 周德富 . 基于混合策略的高精度长术语自动抽取[J]. 中文信息学报, 2009,23(6):26-31.
|
[16] |
( Liang Yinghong, Zhang Wenjing, Zhou Defu . A Hybrid Strategy for High Precision Long Term Extraction[J]. Journal of Chinese Information Processing, 2009,23(6):26-31.)
|
[17] |
孙继鹏, 贾民, 刘增宝 . 一种面向文本的概念抽取方法的研究[J]. 计算机应用与软件, 2009,26(9):28-30.
|
[17] |
( Sun Jipeng, Jia Min, Liu Zengbao . On A Text-Oriented Concept Extraction Technique[J]. Computer Applications and Software, 2009,26(9):28-30.)
|
[18] |
Pazienza M T, Pennacchiotti M, Zanzotto F M. Terminology Extraction: An Analysis of Linguistic and Statistical Approaches [C]// Proceedings of the NEMIS 2004 Final Conference. 2005: 255-279.
|
[19] |
Bouma G. Normalized (Pointwise) Mutual Information in Collocation Extraction [C]// Proceedings of the 2009 International Conference of the German Society for Computational Linguistics and Language Technology. 2009: 31-40.
|
[20] |
杜丽萍, 李晓戈, 周元哲 , 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015,35(4):996-1000.
doi: 10.11772/j.issn.1001-9081.2015.04.0996
|
[20] |
( Du Liping, Li Xiaoge, Zhou Yuanzhe , et al. Application of Improved Point-Wise Mutual Information in Term Extraction[J]. Journal of Computer Applications, 2015,35(4):996-1000.)
doi: 10.11772/j.issn.1001-9081.2015.04.0996
|
[21] |
杜丽萍, 李晓戈, 于根 , 等. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报:自然科学版, 2016,52(1):35-40.
|
[21] |
( Du Liping, Li Xiaoge, Yu Gen , et al. New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016,52(1):35-40.)
|
[22] |
牟冬梅, 任珂 . 三种数据挖掘算法在电子病历知识发现中的比较[J]. 现代图书情报技术, 2016(6):102-109.
|
[22] |
( Mu Dongmei, Ren Ke . Discovering Knowledge from Electronic Medical Records with Three Data Mining Algorithms[J]. New Technology of Library and Information Service, 2016(6):102-109.)
|
[23] |
郭坤, 王浩, 姚宏亮 , 等. 逻辑回归分析的马尔可夫毯学习算法[J]. 智能系统学报, 2012,7(2):153-160.
|
[23] |
( Guo Kun, Wang Hao, Yao Hongliang , et al. An Algorithm for a Markov Blanket Based on Logistic Regression Analysis[J]. CAAI Transactions on Intelligent Systems, 2012,7(2):153-160.)
|
[24] |
顾鑫, 曹丹华, 吴裕斌 , 等. 基于逻辑回归的多任务域快速分类学习算法[J]. 计算机工程与应用, 2017,53(15):47-56.
|
[24] |
( Gu Xin, Cao Danhua, Wu Yubin , et al. Multi- task Coupled Logistic Regression and Its Fast Implementation for Large Multi-task Datasets. Computer Engineering and Applications, 2017,53(15):47-56.)
|
[25] |
官琴, 邓三鸿, 王昊 . 中文文本聚类常用停用词表对比研究[J]. 数据分析与知识发现, 2017,1(3):72-80.
|
[25] |
( Guan Qin, Deng Sanhong, Wang Hao . Chinese Stopwords for Text Clustering: A Comparative Study[J]. Data Analysis and Knowledge Discovery, 2017,1(3):72-80.)
|
[26] |
黄昌宁, 赵海 . 中文分词十年回顾[J]. 中文信息学报, 2007,21(3):8-19.
|
[26] |
( Huang Changning, Zhao Hai . Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007,21(3):8-19.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|