Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (2): 61-71    DOI: 10.11925/infotech.2096-3467.2022.0933
Current Issue | Archive | Adv Search |
Designing and Implementing Automatic Title Generation System for Sci-Tech Papers
Wang Yufei1,2,Zhang Zhixiong1,2(),Zhao Yang1,2,Zhang Mengting1,2,Li Xuesi1,2
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (1931 KB)   HTML ( 34
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper designs an automatic title generation system based on Chinese sci-tech papers’ abstracts, aiming to help researchers compose better titles. [Methods] First, we constructed a large-scale training dataset based on the CSCD database. Then, we created a title generation model with the help of BERT-UniLM. Finally, we designed the system interface using HTTP protocol to enable open calls. [Results] The implemented system could generate titles for articles appropriately. [Limitations] Since the BERT model limits its maximum token length, our new system automatically truncates abstracts exceeding the length limits and might affect the title generation. [Conclusions] This paper provides convenient tools for researchers and literature services, and also benefits automatic generation of titles for other scientific and technological documents.

Key wordsAutomatic Title Generation System      Abstracts of Chinese Scientific and Technical Papers      Text Generation Task      BERT-UniLM     
Received: 05 September 2022      Published: 28 March 2023
ZTFLH:  G254  
Fund:Project of Literature and Information Capacity Building, Chinese Academy of Sciences(E0290906)
Corresponding Authors: Zhang Zhixiong,ORCID:0000-0003-1596-7487,E-mail: zhangzhx@mail.las.ac.cn。   

Cite this article:

Wang Yufei, Zhang Zhixiong, Zhao Yang, Zhang Mengting, Li Xuesi. Designing and Implementing Automatic Title Generation System for Sci-Tech Papers. Data Analysis and Knowledge Discovery, 2023, 7(2): 61-71.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0933     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I2/61

The Structure of Automatic Title Generation System for Scientific and Technical Paper
Architecture of Automatic Title Generation Model for Chinese Scientific and Technical Paper
Self-attention Mask Matrix Used for the Seq2Seq Language Model Objective
最小字数 最大字数 平均字数
标题 5 50 20.04
摘要 15 1 489 261.98
Statistical Information of the Data Set
Statistical Chart of Word Count Interval
学科 论文数量
社会科学 13 773
理学 65 104
医学 120 282
农学 37 312
工学 167 530
The Subject Distribution of the Data Set
Basic Structure of the Data Set
参数 参数值
嵌入层维度 512
隐藏层层数 12
隐藏层维度 768
注意力头数 12
网络参数 110MB
Parameter Values of the BERT-Base-Chinese Model
参数 参数值
epoch 10
batch_size 5
num_beam 1
max_input_seq_length 450
max_output_seq_length 30
Parameter Values of Model Training
模型 文本处理
前/后
ROUGE-1
F1/%
ROUGE-2
F1/%
ROUGE-L
/%
BLEU/%
TextRank 处理前 28.39 18.89 27.90 10.77
处理后 37.26 25.73 33.76 18.00
LSA 处理前 30.65 21.16 31.06 12.55
处理后 37.57 26.44 34.61 18.35
Effects Before and after Text Processing
模型 ROUGE-1
F1/%
ROUGE-2
F1/%
ROUGE-L
/%
BLEU/%
TextRank 37.26 25.73 33.76 18.00
LSA 37.57 26.44 34.61 18.35
BiLSTM+Attention 46.26 34.31 45.92 24.45
BERT-UniLM 68.39 55.54 64.46 44.80
Experimental Results
示例摘要 模型 标题
针对红外与可见光融合的特点,提出一种基于非下采样Contourlet变换(NSCT)和区域能量判断的图像融合方法。利用NSCT变换对两原图像进行分解,得到一个低频子图像和多个不同方向的高频子图像,对低频子带采用最大值的方法进行融合,而高频子带先计算各个系数的区域能量匹配度,再计算判断阈值。当高频系数中各点的匹配度大于阈值时,采用区域能量加权融合方法;当对应点的匹配度小于阈值时,采用区域能量最大值的方法进行融合,通过NSCT逆变换获得融合图像。该方法的特点是算法简单,阈值选取具有自适应性。实验结果表明该方法能够取得较好的视觉效果和量化数据,相比于其他基于NSCT的融合方法,熵值提高了0.5%~6.8%,空间频率提高了1%~ 13%,标准方差提高了0~24.1%,是一种简单有效的融合方法。 原标题 基于NSCT的红外与可见光图像融合方法研究
TextRank 提出一种基于非下采样Contourlet变换和区域能量判断的图像融合方法
LSA 采用区域能量最大值的方法进行融合
BiLSTM+ Attention 低频子图像和区域能量判断的图像融合方法研究与应用研究简单方法
BERT-UniLM 基于nsct和区域能量判断的红外与可见光融合方法
做好土地资源数量管控,加强耕地质量管理和生态管护是当前的一项非常重要的工作。选择吉林省大安市东南区域为研究对象,进行土地质量地球化学评估,并融合污染元素进行农用地分等研究。结果表明研究区内评定为三等及以上的土壤占全区总面积的72.61%,研究区土地质量总体较好,优质和优良土地分布面积较大,主要为黑钙土,差等的土地主要为盐碱土或盐化草甸土。尝试性地将农用地分等成果中的产能评价和土地质量地球化学评估中的元素含量评价结合,开展了绿色产能评价。 原标题 土地质量地球化学评估与绿色产能评价研究:以吉林大安市为例
TextRank 尝试性地将农用地分等成果中的产能评价和土地质量地球化学评估中的元素含量评价结合
LSA 研究区土地质量总体较好
BiLSTM+ Attention 黑钙土地资源数量管理及其农用地分等研究——以吉林省大安市东南
BERT-UniLM 大安市东南区域土地质量地球化学评估及农用地分等研究
Examples of Different Methods
参数 类型 描述 示例
请求参数 list 中文科技论文摘要列表 { “data”:[“智能制造是制造技术与信息技术的结合,并朝着自动化、集成化、信息化、绿色化的趋势发展…”, “金属有机框架 (Metal-Organic Frameworks,MOFs)是由有机配体与金属离子通过配位键形成的多孔结晶性聚合物,具有可调控的周期性孔道结构、…”]}
返回参数 dict 标题生成结果字典 { 0:“智能制造中的状态监测技术” 1:“ 金属有机框架在生物医药领域中的应用” }
POST API Parameters Information
A Demonstration of the Automatic Title Generation System’s Effect
[6] Rush A M, Chopra S, Weston J. A Neural Attention Model for Abstractive Sentence Summarization[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 379-389.
[7] Chopra S, Auli M, Rush A M. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 93-98.
[8] Takase S, Suzuki J, Okazaki N, et al. Neural Headline Generation on Abstract Meaning Representation[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 1054-1059.
[9] 钱揖丽, 马雪雯. 基于句子级LSTM编码的文本标题生成[J]. 计算机应用与软件, 2021, 38(5): 190-195.
[9] (Qian Yili, Ma Xuewen. Text Headline Generation Based on Sentence-level LSTM Encoding[J]. Computer Applications and Software, 2021, 38(5): 190-195.)
[10] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
[11] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[12] Liu Y, Lapata M. Text Summarization with Pretrained Encoders[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3730-3740.
[13] Lewis M, Liu Y H, Goyal N, et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7871-7880.
[14] Dong L, Yang N, Wang W H, et al. Unified Language Model Pre-training for Natural Language Understanding and Generation[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019: 13063-13075.
[15] Dorr B, Zajic D, Schwartz R. Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation[C]// Proceedings of the HLT-NAACL 03 on Text Summarization Workshop. 2003: 1-8.
[16] Gatti L, Ozbal G, Guerini M, et al. Heady-lines: A creative generator of newspaper headlines[C]// Companion Publication of the 21st International Conference on Intelligent User Interfaces. 2016: 79-83.
[17] 蔡中祥. 基于自动文本摘要的党建新闻标题生成系统的设计与实现[D]. 沈阳: 中国科学院沈阳计算技术研究所, 2020.
[17] (Cai Zhongxiang. Design and implementation of News Title Generation System of Party Building Based on Automatic Text Summarization[D]. Shenyang: Shenyang Institute of Computing Technology, Chinese Academy of Sciences, 2020.)
[18] 张智雄, 赵旸, 刘欢. 构建面向实际应用的科技文献自动分类引擎[J]. 中国图书馆学报, 2022, 48(4): 104-115.
[18] (Zhang Zhixiong, Zhao Yang, Liu Huan. Construction of a Practical Application-Oriented Automatic Classification Engine for Scientific Literature[J]. Journal of Library Science in China, 2022, 48(4): 104-115.)
[19] 中国科学文献服务系统[EB/OL].[2022-07-08]. http://sciencechina.cn/.
[19] (ScienceChina[EB/OL].[2022-07-08]. http://sciencechina.cn/.)
[20] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[21] Gong Y H, Liu X. Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis[C]// Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001: 19-25.
[22] Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004. 2004: 74-81.
[23] Papineni K, Roukos S, Ward T, et al. BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002: 311-318.
[24] Grinberg M. Flask Web Development: Developing Web Applications with Python[M]. Sebastopol, CA: O’Reilly Media Inc., 2014.
[25] Li Z, Niu K, He Z Q. Generating Poetry Title Based on Semantic Relevance with Convolutional Neural Network[J]. IOP Conference Series: Materials Science and Engineering, 2017, 235: 012007.
doi: 10.1088/1757-899X/235/1/012007
[26] Ayana, Shen S Q, Lin Y K, et al. Recent Advances on Neural Headline Generation[J]. Journal of Computer Science and Technology, 2017, 32: 768-784.
doi: 10.1007/s11390-017-1758-3
[27] 张智雄, 刘欢, 于改红. 构建基于科技文献知识的人工智能引擎[J]. 农业图书情报学报, 2021, 33(1): 17-31.
doi: 10.13998/j.cnki.issn1002-1248.20-0797
[27] (Zhang Zhixiong, Liu Huan, Yu Gaihong. Building an Artificial Intelligence Engine Based on Scientific and Technological Literature Knowledge[J]. Journal of Library and Information Science in Agriculture, 2021, 33(1): 17-31.)
doi: 10.13998/j.cnki.issn1002-1248.20-0797
[28] 科技文献知识人工智能引擎[EB/OL]. [2022-07-08]. http://sciengine.las.ac.cn/.
[28] (SciAIEngine[EB/OL]. [2022-07-08]. http://sciengine.las.ac.cn/.)
[1] Li Hui, Hu Jixia, Tong Zhiying. Subject Topic Mining and Evolution Analysis with Multi-Source Data[J]. 数据分析与知识发现, 2022, 6(7): 44-55.
[2] Wang Yongsheng, Wang Hao, Yu Wei, Zhou Zeyu. Extracting Relationship Among Characters from Local Chronicles with Text Structures and Contents[J]. 数据分析与知识发现, 2022, 6(2/3): 318-328.
[3] Lv Lucheng, Zhou Jian, Wang Xuezhao, Liu Xiwen. Technology Evolution Analysis Framework Based on Two-Layer Topic Model and Application[J]. 数据分析与知识发现, 2022, 6(2/3): 18-32.
[4] Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[5] Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong. Classification of Academic Papers for Periodical Selection[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[6] Wei Wei,Guo Chonghui,Xing Xiaoyu. Annotating Knowledge Points & Recommending Questions Based on Semantic Association Rules[J]. 数据分析与知识发现, 2020, 4(2/3): 182-191.
[7] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[8] Jinzhu Zhang,Yue Wang,Yiming Hu. Analyzing Sci-Tech Topics Based on Semantic Representation of Patent References[J]. 数据分析与知识发现, 2019, 3(12): 52-60.
[9] Junzhi Jia,Zhuangzhuang Ye. Clustering Wikidata’s Organizational Entities with Latent Semantic Index[J]. 数据分析与知识发现, 2019, 3(10): 56-65.
[10] Zhao Yuxiang,Liu Zhouying,Song Shijie. Exploring the Influential Factors of Askers’ Intention to Pay in Knowledge Q&A Platforms[J]. 数据分析与知识发现, 2018, 2(8): 16-30.
[11] Jia Junzhi,Li Xiao. Analyzing owl:sameAs Network in Linked Data[J]. 数据分析与知识发现, 2017, 1(10): 77-84.
[12] Jiang Lin,Wang Dongbo. Automatically Detecting and Tagging Foreign Language Citation Metadata[J]. 数据分析与知识发现, 2017, 1(1): 47-54.
[13] Wang Xiaoyun, Qian Lu, Huang Shiyou. Collaborative Filtering Recommendation Model Based on Rough User Clustering[J]. 现代图书情报技术, 2015, 31(1): 45-51.
[14] Xia Dong, Xiao Xiaodan, Li Guolei, Chen Xianlai. Research on Correspondence Between Keyword and Chinese Library Classification Based on Latent Semantic Analysis[J]. 现代图书情报技术, 2014, 30(12): 92-96.
[15] Zeng Xinhong, Cai Qinghe, Huang Huajun, Lin Weiming. Research on Non-uniform Node Clustered Graph Layout Algorithm for Visualization Based on Force Directed Model[J]. 现代图书情报技术, 2014, 30(9): 33-43.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn