1Information Security and Big Data Research Institute, Central South University, Changsha 410083, China 2School of Information Science and Engineering, Central South University, Changsha 410083, China 3National Engineering Laboratory for Medical Big Data Application Technology, Central South University, Changsha 410083, China 4Xiangya Hospital, Central South University, Changsha 410078, China
[Objective] This paper modified the method for new word extraction, which are used to improve the performance of medical text segmentation models. [Methods] With the help of traditional mutual information model, we obtained the statistics of words and strings. Then, we established a logical regression classification model with these data, and built an algorithm for new word identification. [Results] A series of experiments were carried out on the texts of electronic medical records from Dermatology Department of Xiangya Hospital. Compared with PMI, PMI 2 and PMI 3, our model with logistic regression achieved the highest accuracy of new words extraction (0.803). [Limitations] To establish the logistic regression model for classification, we have to manually judge whether or not the training strings are words. [Conclusions] The proposed model and algorithm could effectively identify new words from medical records.
陈先来,韩超鹏,安莹,刘莉,李忠民,杨荣. 基于互信息和逻辑回归的新词发现 *[J]. 数据分析与知识发现, 2019, 3(8): 105-113.
Xianlai Chen,Chaopeng Han,Ying An,Li Liu,Zhongmin Li,Rong Yang. Extracting New Words with Mutual Information and Logistic Regression. Data Analysis and Knowledge Discovery, 2019, 3(8): 105-113.
( Li Guolei, Chen Xianlai, Xia Dong , et al. Latent Semantic Analysis of Electronic Medical Record Text for Clinical Decision Making[J]. New Technology of Library and Information Service, 2016(3):50-57.)
[3]
Zhang S, Kang T, Zhang X , et al. Speculation Detection for Chinese Clinical Notes: Impacts of Word Segmentation and Embedding Models[J]. Journal of Biomedical Informatics, 2016,60:334-341.
( Jiang Zhipeng, Zhao Fangfang, Guan Yi , et al. Research on Chinese Electronic Medical Record Oriented Lexical Corpus Annotation[J]. Chinese High Technology Letters, 2014,24(6):609-615.)
( Zhang Libang, Guan Yi, Yang Jinfeng . An Unsupervised Approach to Word Segmentation in Chinese EMRs[J]. Intelligent Computer and Applications, 2014,4(2):68-71.)
[6]
Sui Z, Chen Y. The Research on the Automatic Term Extraction in the Domain of Information Science and Technology [C]// Proceedings of the 5th East Asia Forum of the Terminology. 2002.
( Ren Zhihui, Xu Haoyu, Feng Songlin , et al. Sequence Labeling Chinese Word Segmentation Method Based on LSTM Networks[J]. Application Research of Computers, 2017,34(5):1321-1324.)
[8]
Said L B, Bechikh S, Ghedira K . The R-Dominance: A New Dominance Relation for Interactive Evolutionary Multicriteria Decision Making[J]. IEEE Transactions on Evolutionary Computation, 2010,14(5):801-818.
[9]
Xue N, Shen L. Chinese Word Segmentation as LMR Tagging [C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003: 176-179.
[10]
Chen X, Qiu X, Zhu C, et al. Long Short-Term Memory Neural Networks for Chinese Word Segmentation [C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1197-1206.
[11]
Chen X, Qiu X, Zhu C, et al. Gated Recursive Neural Network for Chinese Word Segmentation [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 2015,1:1744-1753.
[12]
宗成庆 . 统计自然语言处理[M]. 北京: 清华大学出版社, 2008: 103-146.
[12]
( Zong Chengqing. Statistical Natural Language Processing[M]. Beijing: Tsinghua University Press, 2008: 103-146.)
[13]
Pecina P, Schlesinger P. Combining Association Measures for Collocation Extraction [C]// Proceedings of the 21st International Conference on Computational Linguistics. 2006: 651-658.
( Han Yan, Lin Yuxi, Yao Jianming . Study on Chinese OOV Identification Based on Extension[J]. Journal of Chinese Information Processing, 2009,23(3):24-30.)
( Liang Yinghong, Zhang Wenjing, Zhou Defu . A Hybrid Strategy for High Precision Long Term Extraction[J]. Journal of Chinese Information Processing, 2009,23(6):26-31.)
( Sun Jipeng, Jia Min, Liu Zengbao . On A Text-Oriented Concept Extraction Technique[J]. Computer Applications and Software, 2009,26(9):28-30.)
[18]
Pazienza M T, Pennacchiotti M, Zanzotto F M. Terminology Extraction: An Analysis of Linguistic and Statistical Approaches [C]// Proceedings of the NEMIS 2004 Final Conference. 2005: 255-279.
[19]
Bouma G. Normalized (Pointwise) Mutual Information in Collocation Extraction [C]// Proceedings of the 2009 International Conference of the German Society for Computational Linguistics and Language Technology. 2009: 31-40.
( Du Liping, Li Xiaoge, Zhou Yuanzhe , et al. Application of Improved Point-Wise Mutual Information in Term Extraction[J]. Journal of Computer Applications, 2015,35(4):996-1000.)
doi: 10.11772/j.issn.1001-9081.2015.04.0996
( Du Liping, Li Xiaoge, Yu Gen , et al. New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016,52(1):35-40.)
( Mu Dongmei, Ren Ke . Discovering Knowledge from Electronic Medical Records with Three Data Mining Algorithms[J]. New Technology of Library and Information Service, 2016(6):102-109.)
( Guo Kun, Wang Hao, Yao Hongliang , et al. An Algorithm for a Markov Blanket Based on Logistic Regression Analysis[J]. CAAI Transactions on Intelligent Systems, 2012,7(2):153-160.)
( Gu Xin, Cao Danhua, Wu Yubin , et al. Multi- task Coupled Logistic Regression and Its Fast Implementation for Large Multi-task Datasets. Computer Engineering and Applications, 2017,53(15):47-56.)
( Guan Qin, Deng Sanhong, Wang Hao . Chinese Stopwords for Text Clustering: A Comparative Study[J]. Data Analysis and Knowledge Discovery, 2017,1(3):72-80.)
[26]
黄昌宁, 赵海 . 中文分词十年回顾[J]. 中文信息学报, 2007,21(3):8-19.
[26]
( Huang Changning, Zhao Hai . Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007,21(3):8-19.)