Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records
Hu Jiming1,2,Qian Wei1,2,Wen Peng3(),Lv Xiaoguang4
1School of Information Management, Wuhan University, Wuhan 430072, China 2Information Retrieval and Knowledge Mining Laboratory, Wuhan University, Wuhan 430072, China 3School of Marxism, Wuhan University, Wuhan 430072, China 4Renmin Hospital of Wuhan University, Wuhan 430060, China
[Objective] This paper tries to improve the accuracy of text representation and mining, with the help of structural and functional information from Chinese medical records. [Methods] First, we proposed a new semantic representation strategy for the texts of Chinese medical records based on their structure-function features. Then, we used the BiLSTM-CRF model to recognize named entities, which introduced structure information at the word vector level. Finally, we utilized the TextCNN model to extract local context features, which helped us obtain a vector representation with richer text semantic connotations. [Results] The precision, recall and F values of the new model reached 93.20%, 95.19% and 94.19% respectively, while the classification accuracy rate reached 92.12%. [Limitations] Future research is needed to evaluate our model with more texts and refine the structure recognition process. [Conclusions] The proposed method could effectively improve the accuracy of named entity recognition, and enrich the semantic connotation and representation of the texts.
胡吉明, 钱玮, 文鹏, 吕晓光. 基于结构功能和实体识别的文本语义表示——以病历领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang. Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records. Data Analysis and Knowledge Discovery, 2022, 6(8): 110-121.
(Du Lin, Cao Dong, Lin Shuyuan, et al. Extraction and Automatic Classification of TCM Medical Records Based on Attention Mechanism of BERT and Bi-LSTM[J]. Computer Science, 2020, 47(S2): 416-420.)
[2]
中文信息处理发展报告(2016)[R]. 北京: 中国中文信息学会, 2016.
[2]
(Chinese Information Processing Development Report(2016)[R]. Beijing: Chinese Information Processing Society of China, 2016.)
(Wang Qian, Zeng Jin, Liu Jiawei, et al. Structure Function Recognition of Academic Text Paragraph Based on Deep Learning[J]. Information Science, 2020, 38(3): 64-69.)
[5]
Ribeiro S, Yao J T, Rezende D A. Discovering IMRaD Structure with Different Classifiers[C]// Proceedings of the 2018 IEEE International Conference on Big Knowledge. 2018: 200-204.
(General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. Format Specification for Electronic Official Document of Party and Government Organs—Part 1: Official Document Structure: GB/T 33476.1—2016[S]. Beijing: Standards Press of China, 2016.)
(Li Fanshu, Yao Dengfeng. Text Representation and Language Model in Natural Language Processing[C]// Proceedings of the 24th Annual Conference on New Network Technologies and Applications. 2020.)
[8]
Zhang Y, Jin R, Zhou Z H. Understanding Bag-of-Words Model: A Statistical Framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1-4): 43-52.
doi: 10.1007/s13042-010-0001-0
[9]
Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
[10]
McMahon J, Smith F J. A Review of Statistical Language Processing Techniques[J]. Artificial Intelligence Review, 1998, 12: 347-391.
doi: 10.1023/A:1006517723917
[11]
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. The Journal of Machine Learning Research, 2003, 3:1137-1155.
[12]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[13]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
[14]
Devlin J, Chang M, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
[15]
Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014: 1188-1196.
[16]
Zhou C T, Sun C L, Liu Z Y, et al. A C-LSTM Neural Network for Text Classification[OL]. arXiv Preprint, arXiv: 1511.08630.
[17]
Shen D H, Min M R, Li Y T, et al. Learning Context-Sensitive Convolutional Filters for Text Processing[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 1839-1848.
(Wu Hanyu, Yan Jiang, Huang Shaobin, et al. CNN_BiLSTM_Attention Hybrid Model for Text Classification[J]. Computer Science, 2020, 47(S2): 23-27, 34.)
[19]
Pa T L, Kumari M, Singh T, et al. Semantic Representations in Text Data[J]. International Journal of Grid and Distributed Computing, 2018, 11(9): 65-80.
(Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.)
(Yu Yan, Chen Lei, Jiang Jinde, et al. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 53-59.)
[22]
Liu W F, Liu P Y, Yang Y Z, et al. A Embedding Model for Text Classification[J]. Expert Systems, 2019, 36(6): e12460.
[23]
Jiang Z L, Gao S, Chen L C. Study on Text Representation Method Based on Deep Learning and Topic Information[J]. Computing, 2020, 102(3): 623-642.
doi: 10.1007/s00607-019-00755-y
(Yang Chunxia, Wu Jiajun, Li Xinxu. Text Classification Model Based on Recurrent Neural Network with Entity Information[J]. Journal of Chinese Computer Systems, 2020, 41(12): 2516-2521.)
(Huang Lu, Zhou Enguo, Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-Specific Information[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 111-122.)
(Lu Wei, Huang Yong, Cheng Qikai. The Structure Function of Academic Text and Its Classification[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(9): 979-985.)
(Huang Yong, Lu Wei, Cheng Qikai. The Structure Function Recognition of Academic Text——Chapter Content Based Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(3): 293-300.)
(Huang Yong, Lu Wei, Cheng Qikai, et al. The Structure Function Recognition of Academic Text——Paragraph-Based Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(5): 530-538.)
(Hu Jiming, Qian Wei, Li Yuwei, et al. Topic Mining and Structured Parse of Policy Text Based on LDA2Vec[J]. Information Science, 2021, 39(10): 11-17.)
[31]
Laddha A, Joshi S, Shaikh S, et al. Joint Distributed Representation of Text and Structure of Semi-Structured Documents[C]// Proceedings of the 29th on Hypertext and Social Media. 2018: 25-32.
(Che Lei, Yang Xiaoping, Wang Liang, et al. Text Structure Oriented Hybrid Hierarchical Attention Networks for Topic Classification[J]. Journal of Chinese Information Processing, 2019, 33(5): 93-102, 112.)
[33]
Lu Y H, Zhai Y Y, Luo J Y, et al. MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2Vec[J]. American Journal of Information Science and Technology, 2019, 3(3): 62.
doi: 10.11648/j.ajist.20190303.12
(Sun Zhen, Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. New Technology of Library and Information Service, 2010(6): 42-47.)
[35]
Goyal A, Gupta V, Kumar M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
doi: 10.1016/j.cosrev.2018.06.001
(Wang Ruojia, Wei Siyi, Wang Jimin. Applied Research on Named Entity Recognition in Chinese Electronic Medical Record Based on BiLSTM-CRF Model[J]. Journal of Library and Data, 2019, 1(2): 53-66.)
[37]
Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[38]
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
(Yi Shixiang, Yin Hongpeng, Zheng Hengyi. Public Security Event Trigger Identification Based on Bidirectional LSTM[J]. Chinese Journal of Engineering, 2019, 41(9): 1201-1207.)
(Yu Chuanming, Wang Manyi, Lin Hongjun, et al. A Comparative Study of Word Representation Models Based on Deep Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 28-40.)
[41]
Zhang J, Chang D. Semi-Supervised Patient Similarity Clustering Algorithm Based on Electronic Medical Records[J]. IEEE Access, 2019, 7: 90705-90714.
doi: 10.1109/ACCESS.2019.2923333
[42]
Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[43]
Roberts A, Gaizauskas R, Hepple M, et al. Building a Semantically Annotated Corpus of Clinical Texts[J]. Journal of Biomedical Informatics, 2009, 42(5): 950-966.
doi: 10.1016/j.jbi.2008.12.013
pmid: 19535011
(China Conference on Knowledge Graph and Semantic Computing. CCKS 2020: Medical Entity and Event Extraction for Chinese Electronic Medical Records (1) Medical Named Entity Recognition[EB/OL]. [2021-04-10]. https://www.biendata.net/competition/ccks_2020_2_1.)
(Wang Lulu, Aishan Wumaier, Tuergen Yibulayin, et al. Uyghur Named Entity Recognition Based on Deep Neural Network[J]. Journal of Chinese Information Processing, 2019, 33(3): 64-70.)
[46]
陈培新. 文本语义的向量表示与建模方法研究[D]. 合肥: 中国科学技术大学, 2018.
[46]
(Chen Peixin. The Research of Semantic Vector Representations and Modeling Approachesfor Text[D]. Hefei: University of Science and Technology of China, 2018.)
(Lv Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10): 75-85.)
doi: 10.13266/j.issn.0252-3116.2020.10.009
(Hu Jiming, Zheng Xiang, Cheng Qikai, et al. Public Opinion Extraction and Focus Presentation in Government Microblog Based on BiLSTM-CRF[J]. Information Studies: Theory & Application, 2021, 44(1): 174-179, 137.)
[51]
Kowsari K, Meimandi K J, Heidarysafa M, et al. Text Classification Algorithms: A Survey[J]. Information, 2019, 10(4): 150-218.
doi: 10.3390/info10040150