|
|
Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records |
Hu Jiming1,2,Qian Wei1,2,Wen Peng3(),Lv Xiaoguang4 |
1School of Information Management, Wuhan University, Wuhan 430072, China 2Information Retrieval and Knowledge Mining Laboratory, Wuhan University, Wuhan 430072, China 3School of Marxism, Wuhan University, Wuhan 430072, China 4Renmin Hospital of Wuhan University, Wuhan 430060, China |
|
|
Abstract [Objective] This paper tries to improve the accuracy of text representation and mining, with the help of structural and functional information from Chinese medical records. [Methods] First, we proposed a new semantic representation strategy for the texts of Chinese medical records based on their structure-function features. Then, we used the BiLSTM-CRF model to recognize named entities, which introduced structure information at the word vector level. Finally, we utilized the TextCNN model to extract local context features, which helped us obtain a vector representation with richer text semantic connotations. [Results] The precision, recall and F values of the new model reached 93.20%, 95.19% and 94.19% respectively, while the classification accuracy rate reached 92.12%. [Limitations] Future research is needed to evaluate our model with more texts and refine the structure recognition process. [Conclusions] The proposed method could effectively improve the accuracy of named entity recognition, and enrich the semantic connotation and representation of the texts.
|
Received: 14 October 2021
Published: 23 September 2022
|
|
Fund:National Natural Science Foundation of China(71874125);Young Top-notch Talent Cultivation Program of Hubei Province |
Corresponding Authors:
Wen Peng,ORCID:0000-0002-0278-7391
E-mail: wenpeng@whu.edu.cn
|
[1] |
杜琳, 曹东, 林树元, 等. 基于BERT与Bi-LSTM融合注意力机制的中医病历文本的提取与自动分类[J]. 计算机科学, 2020, 47(S2): 416-420.
|
[1] |
(Du Lin, Cao Dong, Lin Shuyuan, et al. Extraction and Automatic Classification of TCM Medical Records Based on Attention Mechanism of BERT and Bi-LSTM[J]. Computer Science, 2020, 47(S2): 416-420.)
|
[2] |
中文信息处理发展报告(2016)[R]. 北京: 中国中文信息学会, 2016.
|
[2] |
(Chinese Information Processing Development Report(2016)[R]. Beijing: Chinese Information Processing Society of China, 2016.)
|
[3] |
周昭涛, 卜东波, 程学旗. 文本的图表示初探[J]. 中文信息学报, 2005, 19(2): 36-43.
|
[3] |
(Zhou Zhaotao, Bu Dongbo, Cheng Xueqi. Towards Graph-Based Text Representation[J]. Journal of Chinese Information Processing, 2005, 19(2): 36-43.)
|
[4] |
王倩, 曾金, 刘家伟, 等. 基于深度学习的学术文本段落结构功能识别研究[J]. 情报科学, 2020, 38(3): 64-69.
|
[4] |
(Wang Qian, Zeng Jin, Liu Jiawei, et al. Structure Function Recognition of Academic Text Paragraph Based on Deep Learning[J]. Information Science, 2020, 38(3): 64-69.)
|
[5] |
Ribeiro S, Yao J T, Rezende D A. Discovering IMRaD Structure with Different Classifiers[C]// Proceedings of the 2018 IEEE International Conference on Big Knowledge. 2018: 200-204.
|
[6] |
国家质量监督检验检疫总局, 中国国家标准化管理委员会. 党政机关电子公文格式规范第1部分:公文结构: GB/T 33476.1—2016[S]. 北京: 中国标准出版社, 2016.
|
[6] |
(General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. Format Specification for Electronic Official Document of Party and Government Organs—Part 1: Official Document Structure: GB/T 33476.1—2016[S]. Beijing: Standards Press of China, 2016.)
|
[7] |
李凡姝, 姚登峰. 自然语言处理中的文本表示和语言模型综述[C]// 中国计算机用户协会网络应用分会2020年第24届网络新技术与应用年会论文集. 2020.
|
[7] |
(Li Fanshu, Yao Dengfeng. Text Representation and Language Model in Natural Language Processing[C]// Proceedings of the 24th Annual Conference on New Network Technologies and Applications. 2020.)
|
[8] |
Zhang Y, Jin R, Zhou Z H. Understanding Bag-of-Words Model: A Statistical Framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1-4): 43-52.
doi: 10.1007/s13042-010-0001-0
|
[9] |
Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
|
[10] |
McMahon J, Smith F J. A Review of Statistical Language Processing Techniques[J]. Artificial Intelligence Review, 1998, 12: 347-391.
doi: 10.1023/A:1006517723917
|
[11] |
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. The Journal of Machine Learning Research, 2003, 3:1137-1155.
|
[12] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
|
[13] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
|
[14] |
Devlin J, Chang M, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
|
[15] |
Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014: 1188-1196.
|
[16] |
Zhou C T, Sun C L, Liu Z Y, et al. A C-LSTM Neural Network for Text Classification[OL]. arXiv Preprint, arXiv: 1511.08630.
|
[17] |
Shen D H, Min M R, Li Y T, et al. Learning Context-Sensitive Convolutional Filters for Text Processing[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 1839-1848.
|
[18] |
吴汉瑜, 严江, 黄少滨, 等. 用于文本分类的CNN_BiLSTM_Attention混合模型[J]. 计算机科学, 2020, 47(S2): 23-27, 34.
|
[18] |
(Wu Hanyu, Yan Jiang, Huang Shaobin, et al. CNN_BiLSTM_Attention Hybrid Model for Text Classification[J]. Computer Science, 2020, 47(S2): 23-27, 34.)
|
[19] |
Pa T L, Kumari M, Singh T, et al. Semantic Representations in Text Data[J]. International Journal of Grid and Distributed Computing, 2018, 11(9): 65-80.
|
[20] |
聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
|
[20] |
(Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.)
|
[21] |
俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
|
[21] |
(Yu Yan, Chen Lei, Jiang Jinde, et al. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 53-59.)
|
[22] |
Liu W F, Liu P Y, Yang Y Z, et al. A Embedding Model for Text Classification[J]. Expert Systems, 2019, 36(6): e12460.
|
[23] |
Jiang Z L, Gao S, Chen L C. Study on Text Representation Method Based on Deep Learning and Topic Information[J]. Computing, 2020, 102(3): 623-642.
doi: 10.1007/s00607-019-00755-y
|
[24] |
杨春霞, 吴佳君, 李欣栩. 融合实体信息的循环神经网络文本分类模型[J]. 小型微型计算机系统, 2020, 41(12): 2516-2521.
|
[24] |
(Yang Chunxia, Wu Jiajun, Li Xinxu. Text Classification Model Based on Recurrent Neural Network with Entity Information[J]. Journal of Chinese Computer Systems, 2020, 41(12): 2516-2521.)
|
[25] |
黄露, 周恩国, 李岱峰. 融合特定任务信息注意力机制的文本表示学习模型[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
|
[25] |
(Huang Lu, Zhou Enguo, Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-Specific Information[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 111-122.)
|
[26] |
秦成磊, 章成志. 基于层次注意力网络模型的学术文本结构功能识别[J]. 数据分析与知识发现, 2020, 4(11): 26-42.
|
[26] |
(Qin Chenglei, Zhang Chengzhi. Recognizing Structure Functions of Academic Articles with Hierarchical Attention Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 26-42.)
|
[27] |
陆伟, 黄永, 程齐凯. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9): 979-985.
|
[27] |
(Lu Wei, Huang Yong, Cheng Qikai. The Structure Function of Academic Text and Its Classification[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(9): 979-985.)
|
[28] |
黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3): 293-300.
|
[28] |
(Huang Yong, Lu Wei, Cheng Qikai. The Structure Function Recognition of Academic Text——Chapter Content Based Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(3): 293-300.)
|
[29] |
黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5): 530-538.
|
[29] |
(Huang Yong, Lu Wei, Cheng Qikai, et al. The Structure Function Recognition of Academic Text——Paragraph-Based Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(5): 530-538.)
|
[30] |
胡吉明, 钱玮, 李雨薇, 等. 基于LDA2Vec的政策文本主题挖掘与结构化解析框架研究[J]. 情报科学, 2021, 39(10): 11-17.
|
[30] |
(Hu Jiming, Qian Wei, Li Yuwei, et al. Topic Mining and Structured Parse of Policy Text Based on LDA2Vec[J]. Information Science, 2021, 39(10): 11-17.)
|
[31] |
Laddha A, Joshi S, Shaikh S, et al. Joint Distributed Representation of Text and Structure of Semi-Structured Documents[C]// Proceedings of the 29th on Hypertext and Social Media. 2018: 25-32.
|
[32] |
车蕾, 杨小平, 王良, 等. 面向文本结构的混合分层注意力网络的话题归类[J]. 中文信息学报, 2019, 33(5): 93-102, 112.
|
[32] |
(Che Lei, Yang Xiaoping, Wang Liang, et al. Text Structure Oriented Hybrid Hierarchical Attention Networks for Topic Classification[J]. Journal of Chinese Information Processing, 2019, 33(5): 93-102, 112.)
|
[33] |
Lu Y H, Zhai Y Y, Luo J Y, et al. MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2Vec[J]. American Journal of Information Science and Technology, 2019, 3(3): 62.
doi: 10.11648/j.ajist.20190303.12
|
[34] |
孙镇, 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010(6): 42-47.
|
[34] |
(Sun Zhen, Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. New Technology of Library and Information Service, 2010(6): 42-47.)
|
[35] |
Goyal A, Gupta V, Kumar M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
doi: 10.1016/j.cosrev.2018.06.001
|
[36] |
王若佳, 魏思仪, 王继民. BiLSTM-CRF模型在中文电子病历命名实体识别中的应用研究[J]. 文献与数据学报, 2019, 1(2): 53-66.
|
[36] |
(Wang Ruojia, Wei Siyi, Wang Jimin. Applied Research on Named Entity Recognition in Chinese Electronic Medical Record Based on BiLSTM-CRF Model[J]. Journal of Library and Data, 2019, 1(2): 53-66.)
|
[37] |
Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
|
[38] |
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
|
[39] |
易士翔, 尹宏鹏, 郑恒毅. 基于BiLSTM的公共安全事件触发词识别[J]. 工程科学学报, 2019, 41(9): 1201-1207.
|
[39] |
(Yi Shixiang, Yin Hongpeng, Zheng Hengyi. Public Security Event Trigger Identification Based on Bidirectional LSTM[J]. Chinese Journal of Engineering, 2019, 41(9): 1201-1207.)
|
[40] |
余传明, 王曼怡, 林虹君, 等. 基于深度学习的词汇表示模型对比研究[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
|
[40] |
(Yu Chuanming, Wang Manyi, Lin Hongjun, et al. A Comparative Study of Word Representation Models Based on Deep Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 28-40.)
|
[41] |
Zhang J, Chang D. Semi-Supervised Patient Similarity Clustering Algorithm Based on Electronic Medical Records[J]. IEEE Access, 2019, 7: 90705-90714.
doi: 10.1109/ACCESS.2019.2923333
|
[42] |
Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
|
[43] |
Roberts A, Gaizauskas R, Hepple M, et al. Building a Semantically Annotated Corpus of Clinical Texts[J]. Journal of Biomedical Informatics, 2009, 42(5): 950-966.
doi: 10.1016/j.jbi.2008.12.013
pmid: 19535011
|
[44] |
全国知识图谱与语义计算大会. CCKS 2020: 面向中文电子病历的医疗实体及事件抽取(一)医疗命名实体识别[EB/OL]. [2021-04-10]. https://www.biendata.net/competition/ccks_2020_2_1.
|
[44] |
(China Conference on Knowledge Graph and Semantic Computing. CCKS 2020: Medical Entity and Event Extraction for Chinese Electronic Medical Records (1) Medical Named Entity Recognition[EB/OL]. [2021-04-10]. https://www.biendata.net/competition/ccks_2020_2_1.)
|
[45] |
王路路, 艾山·吾买尔, 吐尔根·依布拉音, 等. 基于深度神经网络的维吾尔文命名实体识别研究[J]. 中文信息学报, 2019, 33(3): 64-70.
|
[45] |
(Wang Lulu, Aishan Wumaier, Tuergen Yibulayin, et al. Uyghur Named Entity Recognition Based on Deep Neural Network[J]. Journal of Chinese Information Processing, 2019, 33(3): 64-70.)
|
[46] |
陈培新. 文本语义的向量表示与建模方法研究[D]. 合肥: 中国科学技术大学, 2018.
|
[46] |
(Chen Peixin. The Research of Semantic Vector Representations and Modeling Approachesfor Text[D]. Hefei: University of Science and Technology of China, 2018.)
|
[47] |
Jieba[EB/OL]. [2020-08-25]. https://pypi.org/project/jieba/.
|
[48] |
Řehůřek R. Word2Vec Embeddings[EB/OL]. [2020-08-25]. https://radimrehurek.com/gensim/models/word2vec.html.
|
[49] |
吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10): 75-85.
doi: 10.13266/j.issn.0252-3116.2020.10.009
|
[49] |
(Lv Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10): 75-85.)
doi: 10.13266/j.issn.0252-3116.2020.10.009
|
[50] |
胡吉明, 郑翔, 程齐凯, 等. 基于BiLSTM-CRF的政府微博舆论观点抽取与焦点呈现[J]. 情报理论与实践, 2021, 44(1): 174-179, 137.
|
[50] |
(Hu Jiming, Zheng Xiang, Cheng Qikai, et al. Public Opinion Extraction and Focus Presentation in Government Microblog Based on BiLSTM-CRF[J]. Information Studies: Theory & Application, 2021, 44(1): 174-179, 137.)
|
[51] |
Kowsari K, Meimandi K J, Heidarysafa M, et al. Text Classification Algorithms: A Survey[J]. Information, 2019, 10(4): 150-218.
doi: 10.3390/info10040150
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|