Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (8): 105-113    DOI: 10.11925/infotech.2096-3467.2018.1445
Current Issue | Archive | Adv Search |
Extracting New Words with Mutual Information and Logistic Regression
Xianlai Chen1,3,Chaopeng Han2,Ying An1,3,Li Liu1,Zhongmin Li1,Rong Yang4()
1Information Security and Big Data Research Institute, Central South University, Changsha 410083, China
2School of Information Science and Engineering, Central South University, Changsha 410083, China
3National Engineering Laboratory for Medical Big Data Application Technology, Central South University, Changsha 410083, China
4Xiangya Hospital, Central South University, Changsha 410078, China
Download: PDF (748 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper modified the method for new word extraction, which are used to improve the performance of medical text segmentation models. [Methods] With the help of traditional mutual information model, we obtained the statistics of words and strings. Then, we established a logical regression classification model with these data, and built an algorithm for new word identification. [Results] A series of experiments were carried out on the texts of electronic medical records from Dermatology Department of Xiangya Hospital. Compared with PMI, PMI 2 and PMI 3, our model with logistic regression achieved the highest accuracy of new words extraction (0.803). [Limitations] To establish the logistic regression model for classification, we have to manually judge whether or not the training strings are words. [Conclusions] The proposed model and algorithm could effectively identify new words from medical records.

Key wordsMedical Text      Word Segmentation      New Word Discovery      Logistic Regression      Mutual Information Model     
Received: 24 December 2018      Published: 29 September 2019
ZTFLH:  TP393 G35  
Corresponding Authors: Rong Yang     E-mail: cxlyr0576@163.com

Cite this article:

Xianlai Chen,Chaopeng Han,Ying An,Li Liu,Zhongmin Li,Rong Yang. Extracting New Words with Mutual Information and Logistic Regression. Data Analysis and Knowledge Discovery, 2019, 3(8): 105-113.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1445     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I8/105

训练集规模 校验集规模 准确率 精准率 召回率 F1值
100 1 000 0.692 0.712 0.658 0.684
200 1 000 0.711 0.724 0.701 0.712
300 1 000 0.755 0.744 0.801 0.771
500 1 000 0.801 0.821 0.802 0.811
1 000 1 000 0.792 0.812 0.824 0.802
1 500 1 000 0.806 0.831 0.829 0.830
2 000 1 000 0.804 0.827 0.837 0.829
模型 提取的前20条新词
PMI 主诉 全身 身皮 皮肤 肤红 红斑 丘疹 疹年 加重 重伴
伴糜 糜烂 烂结 结痂 痂年 年余 余现 现病 病史 患者
PMI2 主诉 全身 皮肤 红斑 丘疹 加重 伴糜 结痂 现病 病史
患者 年前 明显 诱因 出现 皮损 对症 分布 疗后 以来
PMI3 <未提取到任何词语>
PMI+LR 主诉 全身 皮肤 肤红 红斑 丘疹 加重 伴糜 结痂 现病
病史 患者 年前 明显 显诱 诱因 出现 皮损 对症 分布
模型 词语数量 准确率 召回率
PMI 43 531 19.7% 100%
PMI2 852 89.7% 8.9%
PMI3 0 - 0
PMI+LR 8 605 80.3% 82.1%
词语个数 专有词语
二字
词语
充盈 肌酐 管瘤 囊肿 盗汗 望城 甲亢 晕厥 祁阳 癫痫 汤剂 麝香 胬肉 吡嗪 东莞 挛缩 钡餐
三字
词语
汉寿县 过敏史 甲状腺 尿常规 甘石洗 踝关节 转氨酶 脱氢酶 银屑病 娄底市 磷霉素 岳阳市 东安县 肾移植 骨髓瘤 江华县 宜章县
四字
词语
头孢他啶 地塞米松 苯海拉明 活血化瘀 灰黄霉素 右旋糖酐 宣武医院 黔东南州 张家界市 呋喃唑酮 核糖核酸 高钾血症 重铬酸钾
分词方法 准确率 召回率 F1值
jieba 0.781 0.812 0.752
PMI+jieba 0.822 0.876 0.848
PMI2+jieba 0.834 0.869 0.851
PMI3+jieba 0.781 0.812 0.752
PMI+LR+jieba 0.908 0.956 0.929
[1] 雷健波 . 电子病历的核心价值与临床决策支持[J]. 中国数字医学, 2008,3(3):26-30.
[1] ( Lei Jianbo . Clinical Decision Support and the Core Value of Electronic Medical Record[J]. China Digital Medicine, 2008,3(3):26-30.)
[2] 李国垒, 陈先来, 夏冬 , 等. 面向临床决策的电子病历文本潜在语义分析[J]. 现代图书情报技术, 2016(3):50-57.
[2] ( Li Guolei, Chen Xianlai, Xia Dong , et al. Latent Semantic Analysis of Electronic Medical Record Text for Clinical Decision Making[J]. New Technology of Library and Information Service, 2016(3):50-57.)
[3] Zhang S, Kang T, Zhang X , et al. Speculation Detection for Chinese Clinical Notes: Impacts of Word Segmentation and Embedding Models[J]. Journal of Biomedical Informatics, 2016,60:334-341.
[4] 蒋志鹏, 赵芳芳, 关毅 , 等. 面向中文电子病历的词法语料标注研究[J]. 高技术通讯, 2014,24(6):609-615.
[4] ( Jiang Zhipeng, Zhao Fangfang, Guan Yi , et al. Research on Chinese Electronic Medical Record Oriented Lexical Corpus Annotation[J]. Chinese High Technology Letters, 2014,24(6):609-615.)
[5] 张立邦, 关毅, 杨锦峰 . 基于无监督学习的中文电子病历分词[J]. 智能计算机与应用, 2014,4(2):68-71.
[5] ( Zhang Libang, Guan Yi, Yang Jinfeng . An Unsupervised Approach to Word Segmentation in Chinese EMRs[J]. Intelligent Computer and Applications, 2014,4(2):68-71.)
[6] Sui Z, Chen Y. The Research on the Automatic Term Extraction in the Domain of Information Science and Technology [C]// Proceedings of the 5th East Asia Forum of the Terminology. 2002.
[7] 任智慧, 徐浩煜, 封松林 , 等. 基于LSTM网络的序列标注中文分词法[J]. 计算机应用研究, 2017,34(5):1321-1324.
[7] ( Ren Zhihui, Xu Haoyu, Feng Songlin , et al. Sequence Labeling Chinese Word Segmentation Method Based on LSTM Networks[J]. Application Research of Computers, 2017,34(5):1321-1324.)
[8] Said L B, Bechikh S, Ghedira K . The R-Dominance: A New Dominance Relation for Interactive Evolutionary Multicriteria Decision Making[J]. IEEE Transactions on Evolutionary Computation, 2010,14(5):801-818.
[9] Xue N, Shen L. Chinese Word Segmentation as LMR Tagging [C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003: 176-179.
[10] Chen X, Qiu X, Zhu C, et al. Long Short-Term Memory Neural Networks for Chinese Word Segmentation [C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1197-1206.
[11] Chen X, Qiu X, Zhu C, et al. Gated Recursive Neural Network for Chinese Word Segmentation [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 2015,1:1744-1753.
[12] 宗成庆 . 统计自然语言处理[M]. 北京: 清华大学出版社, 2008: 103-146.
[12] ( Zong Chengqing. Statistical Natural Language Processing[M]. Beijing: Tsinghua University Press, 2008: 103-146.)
[13] Pecina P, Schlesinger P. Combining Association Measures for Collocation Extraction [C]// Proceedings of the 21st International Conference on Computational Linguistics. 2006: 651-658.
[14] 刘华 . 一种快速获取领域新词语的新方法[J]. 中文信息学报, 2006,20(5):19-25.
[14] ( Liu Hua . A New Approach for Domain New Words Detection[J]. Journal of Chinese Information Processing, 2006,20(5):19-25.)
[15] 韩艳, 林煜熙, 姚建明 . 基于统计信息的未登录词的扩展识别方法[J]. 中文信息学报, 2009,23(3):24-30.
[15] ( Han Yan, Lin Yuxi, Yao Jianming . Study on Chinese OOV Identification Based on Extension[J]. Journal of Chinese Information Processing, 2009,23(3):24-30.)
[16] 梁颖红, 张文静, 周德富 . 基于混合策略的高精度长术语自动抽取[J]. 中文信息学报, 2009,23(6):26-31.
[16] ( Liang Yinghong, Zhang Wenjing, Zhou Defu . A Hybrid Strategy for High Precision Long Term Extraction[J]. Journal of Chinese Information Processing, 2009,23(6):26-31.)
[17] 孙继鹏, 贾民, 刘增宝 . 一种面向文本的概念抽取方法的研究[J]. 计算机应用与软件, 2009,26(9):28-30.
[17] ( Sun Jipeng, Jia Min, Liu Zengbao . On A Text-Oriented Concept Extraction Technique[J]. Computer Applications and Software, 2009,26(9):28-30.)
[18] Pazienza M T, Pennacchiotti M, Zanzotto F M. Terminology Extraction: An Analysis of Linguistic and Statistical Approaches [C]// Proceedings of the NEMIS 2004 Final Conference. 2005: 255-279.
[19] Bouma G. Normalized (Pointwise) Mutual Information in Collocation Extraction [C]// Proceedings of the 2009 International Conference of the German Society for Computational Linguistics and Language Technology. 2009: 31-40.
[20] 杜丽萍, 李晓戈, 周元哲 , 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015,35(4):996-1000.
doi: 10.11772/j.issn.1001-9081.2015.04.0996
[20] ( Du Liping, Li Xiaoge, Zhou Yuanzhe , et al. Application of Improved Point-Wise Mutual Information in Term Extraction[J]. Journal of Computer Applications, 2015,35(4):996-1000.)
doi: 10.11772/j.issn.1001-9081.2015.04.0996
[21] 杜丽萍, 李晓戈, 于根 , 等. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报:自然科学版, 2016,52(1):35-40.
[21] ( Du Liping, Li Xiaoge, Yu Gen , et al. New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016,52(1):35-40.)
[22] 牟冬梅, 任珂 . 三种数据挖掘算法在电子病历知识发现中的比较[J]. 现代图书情报技术, 2016(6):102-109.
[22] ( Mu Dongmei, Ren Ke . Discovering Knowledge from Electronic Medical Records with Three Data Mining Algorithms[J]. New Technology of Library and Information Service, 2016(6):102-109.)
[23] 郭坤, 王浩, 姚宏亮 , 等. 逻辑回归分析的马尔可夫毯学习算法[J]. 智能系统学报, 2012,7(2):153-160.
[23] ( Guo Kun, Wang Hao, Yao Hongliang , et al. An Algorithm for a Markov Blanket Based on Logistic Regression Analysis[J]. CAAI Transactions on Intelligent Systems, 2012,7(2):153-160.)
[24] 顾鑫, 曹丹华, 吴裕斌 , 等. 基于逻辑回归的多任务域快速分类学习算法[J]. 计算机工程与应用, 2017,53(15):47-56.
[24] ( Gu Xin, Cao Danhua, Wu Yubin , et al. Multi- task Coupled Logistic Regression and Its Fast Implementation for Large Multi-task Datasets. Computer Engineering and Applications, 2017,53(15):47-56.)
[25] 官琴, 邓三鸿, 王昊 . 中文文本聚类常用停用词表对比研究[J]. 数据分析与知识发现, 2017,1(3):72-80.
[25] ( Guan Qin, Deng Sanhong, Wang Hao . Chinese Stopwords for Text Clustering: A Comparative Study[J]. Data Analysis and Knowledge Discovery, 2017,1(3):72-80.)
[26] 黄昌宁, 赵海 . 中文分词十年回顾[J]. 中文信息学报, 2007,21(3):8-19.
[26] ( Huang Changning, Zhao Hai . Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007,21(3):8-19.)
[1] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[2] Ding Shengchun,Yu Fengyang,Li Zhen. Identifying Potential Trending Topics of Online Public Opinion[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[3] Du Jian. Measuring Uncertainty of Medical Knowledge: A Literature Review[J]. 数据分析与知识发现, 2020, 4(10): 14-27.
[4] Wenxiu Hu,Li Ma,Jianfeng Zhang. Identifying Ultra-short-term Market Manipulation with Stock Intraday Trading Weighted Network[J]. 数据分析与知识发现, 2019, 3(10): 118-126.
[5] Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[6] Ni Weijian,Sun Haohao,Liu Tong,Zeng Qingtian. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[7] Wang Xiaoyu,Li Bin. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[8] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[9] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[10] Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[11] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[12] Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[13] Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[14] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[15] Mai Fanjin,Wang Ting. Sense Disambiguation of Chinese Segmentation Based on Bi-direction Matching Method and HMM[J]. 现代图书情报技术, 2008, 24(8): 37-41.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn