Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (7): 70-80    DOI: 10.11925/infotech.2096-3467.2020.1139
Current Issue | Archive | Adv Search |
Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning
Zhao Danning1,Mu Dongmei1,2(),Bai Sen2
1School of Public Health, Jilin University, Changchun 130021, China
2Division of Clinical Research, The First Hospital of Jilin University, Changchun 130021, China
Download: PDF (1115 KB)   HTML ( 20
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a deep learning-based method to automatically extract key elements from unstructured abstracts of sci-tech literature. [Methods] We used structured abstracts as the training corpus, and utilized deep learning methods (e.g., LSTM and the attention mechanism) to extract “objective”, “method” and “results” from the sci-tech literature, and then generated new structured abstracts. [Results] The method’s F-scores were 0.951, 0.916, and 0.960 respectively for the three structural elements of “objective”, “method”, and “results”. [Limitations] The deep learning technique in this paper is relatively uninterpretable. [Conclusions] The proposed method could effectively extract elements from unstructured abstracts.

Key wordsDeep Learning      Attention-LSTM      Structural Elements Extraction     
Received: 18 November 2020      Published: 11 August 2021
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(71974074);Scientific and Technological Developing Scheme of Ji Lin Province(20200301004RQ)
Corresponding Authors: Mu Dongmei     E-mail: moudm@jlu.edu.cn

Cite this article:

Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning. Data Analysis and Knowledge Discovery, 2021, 5(7): 70-80.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1139     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I7/70

Training Flow of the Model
Structure Diagram of Structural Element Extraction Model
Structure of Attention Layer Network
结构要素 提示词
目的 Introduction
Aim
Purpose
Objective
Background
方法 Method
Design
Material and Method
结果 Finding
Result
Conclusion
Result and Conclusion
Mapping Relationship Between Structural Elements and Prompt Words
器件 规格/型号
处理器 Intel i7-4700MQ @2.40GHz
内存 32GB
显卡 GeForce TGX 765M
显存 2GB
TensorFlow版本 2.0.0
Experimental Environment Parameters
Statistical Results of Abstract Length
Statistical Results of Abstract Sentence Length
参数 参数值
Epoch 20
Batch Size 64
优化器 Adam
LSTM输出向量维度 40
Parameter Settings
Correlation Between the Number of LSTM Layers and Model Accuracy
Network Structure of the Attention Layer with No Shared Weights
Correlation Between the Attention Mechanism and Model Accuracy
An Example of Unstructured Abstract
An Example of Machine Annotation Results for Unstructured Abstract
PMID “方法”首句真值 “结果”首句真值 “方法”首句预测 “结果”首句预测 “摘要”句子总数
31586506 2 5 2 6 6
31823650 6 9 6 9 11
31106574 6 7 6 7 11
31257264 3 6 3 6 9
32105001 5 10 5 10 13
31166383 2 4 2 4 8
Examples of Manual and Machine Labeling Results of Unstructured Abstracts
评价指标 评价结果
MS E M 0.148
MS E R 0.556
MS E MR 0.352
Mean Square Error of Structured Abstracts
评价指标 评价结果
MS E M 0.348
MS E R 0.418
MS E MR 0.383
Mean Square Error of Unstructured Abstracts
结构要素 准确率 召回率 F值
目的 0.953 0.988 0.970
方法 0.937 0.931 0.934
结果 0.973 0.965 0.969
平均值 0.954 0.961 0.958
Effect Evaluation of Structured Abstracts
结构要素 准确率 召回率 F值
目的 0.937 0.966 0.951
方法 0.922 0.910 0.916
结果 0.965 0.956 0.960
平均值 0.941 0.944 0.942
Effect Evaluation of Unstructured Abstracts
结构要素 准确率 召回率 F值
本文 文献[7] 本文 文献[7] 本文 文献[7]
目的 0.953 0.957 0.988 0.960 0.970 0.958
方法 0.937 0.900 0.931 0.900 0.934 0.904
结果 0.973 0.907 0.965 0.912 0.969 0.910
Accuracy Comparison of Structured Abstracts
结构要素 准确率 召回率 F值
本文 文献[7] 本文 文献[7] 本文 文献[7]
目的 0.937 0.955 0.966 0.841 0.951 0.895
方法 0.922 0.787 0.910 0.856 0.916 0.820
结果 0.965 0.819 0.956 0.825 0.960 0.822
Accuracy Comparison of Unstructured Abstracts
领域 目的 方法 结果
准确率 召回率 F值 准确率 召回率 F值 准确率 召回率 F值
1 0.968 0.966 0.967 0.933 0.923 0.928 0.971 0.977 0.974
2 0.956 0.962 0.959 0.930 0.916 0.923 0.975 0.980 0.977
3 0.967 0.960 0.963 0.905 0.919 0.912 0.972 0.968 0.970
4 0.969 0.968 0.968 0.920 0.928 0.924 0.974 0.971 0.973
5 0.967 0.956 0.961 0.932 0.923 0.928 0.971 0.979 0.975
6 0.971 0.964 0.968 0.947 0.916 0.932 0.952 0.978 0.965
7 0.981 0.967 0.974 0.958 0.954 0.956 0.970 0.977 0.973
8 0.982 0.968 0.975 0.919 0.947 0.933 0.978 0.967 0.972
9 0.960 0.980 0.970 0.944 0.905 0.924 0.969 0.981 0.975
10 0.969 0.984 0.976 0.966 0.936 0.951 0.969 0.984 0.977
Model Accuracy in Different Fields of Medline
序号 领域 期刊
1 Environmental Science Agriculture Ecosystem & Environment
2 Chemistry Carbohydrate Polymers
3 Computer Science Expect Systems with Applications
4 Food Science & Technology Journal of Food Engineering
Partial Fields and Journals List of Web of Science
领域 目的 方法 结果
准确率 召回率 F值 准确率 召回率 F值 准确率 召回率 F值
1 0.865 0.951 0.906 0.720 0.699 0.709 0.966 0.932 0.949
2 0.672 1.000 0.804 0.910 0.628 0.743 0.969 0.947 0.957
3 0.746 0.935 0.830 0.817 0.378 0.517 0.621 0.956 0.752
4 0.741 0.948 0.832 0.747 0.541 0.628 0.832 0.879 0.855
Accuracy in Different Fields of Web of Science
[1] 赵丽莹, 苗秀芝, 国荣. 中文科技期刊采用结构式长摘要的建议[J]. 编辑学报, 2017, 29(S1):59-61.
[1] (Zhao Liying, Miao Xiuzhi, Guo Rong. Suggestions on Extended Structured Abstract of Chinese Language Sci-Tech Journal[J]. Acta Editologica, 2017, 29(S1):59-61.)
[2] Zhang C F, Liu X L. Review of James Hartley’s Research on Structured Abstracts[J]. Journal of Information Science, 2011, 37(6):570-576.
doi: 10.1177/0165551511420217
[3] Budgen D, Burn A J, Kitchenham B. Reporting Computing Projects Through Structured Abstracts: A Quasi-experiment[J]. Empirical Software Engineering, 2011, 16(2):244-277.
doi: 10.1007/s10664-010-9139-3
[4] 李清. 基于机器学习的文本摘要技术的研究与实现[D]. 成都: 电子科技大学, 2020.
[4] (Li Qing. Research and Implementation of Text Summarization Technology Based on Machine Learning[D]. Chengdu: University of Electronic Science and Technology of China, 2020.)
[5] 周青宇. 基于深度神经网络的文本自动摘要研究[D]. 哈尔滨: 哈尔滨工业大学, 2020.
[5] (Zhou Qingyu. Research on Deep Neural Networks Based Automatic Text Summarization[D]. Harbin: Harbin Institute of Technology, 2020.)
[6] Almugbel Z, Elhaggar N, Bugshan N. Automatic Structured Abstract for Research Papers Supported by Tabular Format Using NLP[J]. International Journal of Advanced Computer Science and Applications, 2019, 10(2):233-240.
[7] Nam S, Jeong S, Kim S K, et al. Structuralizing Biomedical Abstracts with Discriminative Linguistic Features[J]. Computers in Biology and Medicine, 2016, 79:276-285.
doi: 10.1016/j.compbiomed.2016.10.026
[8] Ripple A M, Mork J G, Knecht L S, et al. A Retrospective Cohort Study of Structured Abstracts in Medline, 1992-2006[J]. Journal of the Medical Library Association, 2011, 99(2):160-163.
doi: 10.3163/1536-5050.99.2.009 pmid: 21464855
[9] Harbourt A M, Knecht L S, Humphreys B L. Structured Abstracts in Medline, 1989-1991[J]. Bulletin of the Medical Library Association, 1995, 83(2):190-195.
pmid: 7599584
[10] Ripple A M, Mork J G, Rozier J M, et al. Structured Abstracts in Medline: Twenty-Five Years Later[R]. National Library of Medicine, 2012: 1-3.
[11] 曾志红. 科技期刊结构式摘要的探索与实践——以数学学术性论文为例[J]. 湖北第二师范学院学报, 2019, 36(12):104-108.
[11] (Zeng Zhihong. Exploration and Practice of Structured Abstracts in Scientific Journals Exploration and Practice of Structured Abstracts in Scientific Journals[J]. Journal of Hubei University of Education, 2019, 36(12):104-108.)
[12] 宋东桓, 李晨英, 刘子瑜, 等. 英文科技论文摘要的语义特征词典构建[J]. 图书情报工作, 2020, 64(6):108-119.
[12] (Song Donghuan, Li Chenying, Liu Ziyu, et al. Semantic Feature Dictionary Construction of Abstract in English Scientific Journals[J]. Library and Information Service, 2020, 64(6):108-119.)
[13] Gratez N. Teaching EFL Students to Extract Structural Information from Abstracts[A]// Ulijn J M, Pugh A K. Reading for Professional Purposes: Methods and Materials in Teaching Languages[M]. Leuven, Belgium: Acco Press, 1985: 123-135.
[14] Nilsen D L F, Nilsen A P. Semantic Theory: A Linguistic Perspective[J]. Teaching German, 1975, 11(2):1-20.
[15] 郑梦悦, 秦春秀, 马续补. 面向中文科技文献非结构化摘要的知识元表示与抽取研究——基于知识元本体理论[J]. 情报理论与实践, 2020, 43(2):157-163.
[15] (Zheng Mengyue, Qin Chunxiu, Ma Xubu. Research on Knowledge Unit Representation and Extraction for Unstructured Abstracts of Chinese Scientific and Technical Literature: Ontology Theory Based on Knowledge Unit[J]. Information Studies: Theory and Application, 2020, 43(2):157-163.)
[16] 邹箭, 钟茂生, 孟荔. 中文文本分割模式获取及其优化方法[J]. 南昌大学学报(理科版), 2011, 49(6):597-601.
[16] (Zou Jian, Zhong Maosheng, Meng Li. Method of Chinese Text Segmentation Model Acquisition and its Optimization[J]. Journal of Nanchang University(Natural Science), 2011, 49(6):597-601.)
[17] Ribeiro S, Yao J T, Rezende D A. Discovering IMRaD Structure with Different Classifiers[C]// Proceedings of IEEE International Conference on Big Knowledge (ICBK), Singapore. Los Alamitos, CA: IEEE Computer Society, 2018: 200-204.
[18] 丁良萍, 张智雄, 刘欢. 影响支持向量机模型语步自动识别效果的因素研究[J]. 数据分析与知识发现, 2019, 3(11):16-23.
[18] (Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(11):16-23.)
[19] 赵丹宁, 牟冬梅, 斯琴. 研究型科技文献的实验数据自动抽取研究——以药物代谢动力学文献为例[J]. 图书馆建设, 2017, 40(12):33-38.
[19] (Zhao Danning, Mu Dongmei, Si Qin. Research on Experimental Data Automatic Extraction of Scientific and Technological Literature——A Case Study of Pharmacokinetic Literature[J]. Library Development, 2017, 40(12):33-38.)
[20] 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8):53-61.
[20] (Chen Guo, Xu Tianxiang. Sentence Function Recognition Based on Active Learning[J]. Data Analysis and Knowledge Discovery, 2019, 3(8):53-61.)
[21] Yang M, Tu W T, Wang J X, et al. Attention-Based LSTM for Target-Dependent Sentiment Classification[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 5013-5014.
[22] Gers F A, Schmidhuber J, Cummins F, et al. Learning to Forget: Continual Prediction with LSTM[J]. Neural Computation, 2000, 12(10):2451-2471.
pmid: 11032042
[23] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.
pmid: 9377276
[24] 赵华茗, 余丽, 周强. 基于均值漂移算法的文本聚类数目优化研究[J]. 数据分析与知识发现, 2019, 3(9):27-35.
[24] (Zhao Huaming, Yu Li, Zhou Qiang. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. Data Analysis and Knowledge Discovery, 2019, 3(9):27-35.)
[1] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[3] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[4] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[5] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[6] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[7] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[8] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[9] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[10] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[11] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[12] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[13] Yu Chuanming, Zhang Zhengang, Kong Lingge. Comparing Knowledge Graph Representation Models for Link Prediction[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[14] Han Pu, Zhang Wei, Zhang Zhanpeng, Wang Yuxin, Fang Haoyu. Sentiment Analysis of Weibo Posts on Public Health Emergency with Feature Fusion and Multi-Channel[J]. 数据分析与知识发现, 2021, 5(11): 68-79.
[15] Huang Lu,Zhou Enguo,Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-specific Information[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn