Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (7): 107-117    DOI: 10.11925/infotech.2096-3467.2021.1307
Original article Current Issue | Archive | Adv Search |
STNLTP: Generating Chinese Patent Abstracts Based on Integrated Strategy
Zhang Le,Du Yifan,Lü Xueqiang(),Dong Zhian
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
Download: PDF (1037 KB)   HTML ( 76
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes an abstracting model for Chinese patents based on integration strategy (STNLTP), aiming to reduce the duplication and long document dependency issues of the existing automatic abstracting techniques. [Methods] First, we introduced a patent term dictionary, and used the sememe vector based on SAT model to represent traditional Chinese medicine patents. Then, with the help of integration strategy, we utilized the TextRank, Lead4 and NMF models to extract key sentences from the patents. Third, we identified the optimal key sentences with the clustering and redundancy removing. Finally, we processed these optimal key sentences with the pointer-generator network based on Transformer character vector to create the abstracts. [Results] Our new model successfully combined the extractive and generative methods. Compared with the existing RLCPAR model, we improved the evaluation indicators of ROUGE-1, ROUGE-2 and ROUGE-L by 2.00%, 9.73% and 2.35%, respectively. [Limitations] There are still some errors in the new abstracts. [Conclusions] The new STNLTP model could effectively generate Chinese patent abstracts.

Key wordsPatent Abstract      Sememe      Word Vector      Character Vector      Pointer-Generator Network     
Received: 17 November 2021      Published: 01 March 2022
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(62171043)
Corresponding Authors: Lü Xueqiang,ORCID:0000-0002-1422-0560     E-mail: lxq@bistu.edu.cn

Cite this article:

Zhang Le, Du Yifan, Lü Xueqiang, Dong Zhian. STNLTP: Generating Chinese Patent Abstracts Based on Integrated Strategy. Data Analysis and Knowledge Discovery, 2022, 6(7): 107-117.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1307     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I7/107

Framework for Chinese Patent Abstract Generation
原句 分词结果
本发明涉及一种秸秆的加工
方法,尤其涉及一种玉米杆
饮料及其加工方法。
本发明 涉及 秸秆 加工
方法 涉及 玉米杆 饮料
及其 加工 方法
Examples of Sentence Preprocessing
The Sememe Attention Model Based on Target Words
Transformer-based Pointer Generation Network
字段 示例
原始标题 枸杞茶及其制备方法
原始摘要 本发明公开了一种枸杞茶及其制备方法。该茶完全由枸杞嫩叶制成,呈青绿色条束状,其水分含量<7.0%,胡罗卜素含量>0.004.0%,灰分含量<9.0%,水浸出物>25%,粉末<4.0%。将枸杞嫩叶经过摊凉——蒸气杀青——杀青叶摊凉——初炒——揉捻——复炒理条——热空气干燥七道工芝制成枸杞茶。该茶独具补肾益精、清热止渴、祛风明目、养颜润肤之功效,是一种新型保健茶。
原始说明书(截取) 枸杞茶及其制备方法本发明涉及一种茶叶及其制备方法。现有的茶叶都是利用茶树嫩叶制成的,制备方法一般是采用火炒杀青、初烘、揉捻、初炒、复揉、炒干六个步骤工艺,其中第一步火炒杀青工艺,易造成原料受热不均匀,部分叶片或叶片边缘变为深竭色,降低茶叶质量,严重时造成无法揉捻、产品报废。本发明的目的在于提供一种利用枸杞树嫩叶制成的茶叶及其制备的方法。本发明枸杞茶完全由枸杞嫩叶制成,呈青绿色条束状,其水分含量<7.0%,胡罗卜素含量……
人工摘要 一种枸杞茶的制备方法。将鲜嫩的枸杞叶摊凉,使叶质变软,然后把枸杞叶摊成厚度2-3cm,并置于蒸汽杀青锅内杀青3-5分钟,再放在竹帘上摊凉,凉后放在120-130℃的沙锅中初炒,时间3-5分钟,取出放凉,再用揉捻机进行揉捻,时间为25-30分钟,然后再复炒理条,锅温100-90℃,时间10-12分钟,最后进行热空气干燥,使含水量小于7%,即制成了枸杞茶。该茶独具补肾益精、清热止渴、祛风明目、养颜润肤的功效,是一种保健茶。
Examples of Patent Data
参数名 参数值
编码器句子最大长度 512
解码器句子最大长度 256
训练集批处理大小 16
验证集批处理大小 128
学习率 0.001
隐藏层维度 256
最大梯度范数 1
Parameter Setting
模型 ROUGE-1/% ROUGE-2/% ROUGE-L/%
Baseline 55.84 42.52 47.75
PGN+RL 44.85 26.00 36.15
FASRS 54.84 36.48 48.21
RLCPAR 55.89 36.96 49.73
STNLTP-Text 55.48 41.05 45.51
STNLTP 57.89 46.69 52.08
Experimental Results
对比项 内容
原始摘要 本发明涉及中药制剂领域的一种治疗中暑感冒的中药方剂。其技术方案是:包括下列重量份的原料组成:香薷10–15,藿香10–15,佩兰10–15,苏叶10–15,银花15–20,连翘10–15,板蓝根30–35,大青叶30–35,青蒿15–20,川朴5–10,鸡苏散10–15,栀子6–9。本发明具有清暑利湿、辛温解表、芳香化湿的效果。对治疗中暑感冒有效率达到96%,治愈率85%。
生成摘要 一种治疗中暑感冒的中药方法,包括下列原料组成:香薷、藿香、佩兰、苏叶、银花、连翘、板蓝根、大青叶、青蒿、川朴、鸡苏散、栀子制成。制备方法为,上述原料加水煎煮,过滤,滤液浓缩成汤剂,即得。该药具有清暑利湿、辛温解表、芳香化湿的效果。
人工摘要 一种中药方剂,香薷、藿香、佩兰、苏叶、银花、连翘、板蓝根(板兰根)、大青叶、青蒿、川朴、鸡苏散、栀子。该方剂具有清暑利湿、辛温解表、芳香化湿的效果,用于治疗中暑感冒。恶寒头痛较剧者,加川芎、蔓荆子。周身关节酸楚者,加秦艽、大豆卷。恶心呕吐者,加陈皮、法半夏。脘痞困倦者,加苍术、薏苡仁。心烦胸闷者,加川连、广郁金。大便稀薄者,加苍白术、山楂、神曲。
分析 生成摘要比原始摘要更加简洁,信息更加完整。
……
Examples of Patent Summary Results
模型 ROUGE-1/% ROUGE-2/% ROUGE-L/%
TextRank 32.80 16.86 20.73
TextRank_sememe 36.08 20.38 22.86
Lead4 35.19 18.09 22.72
NMF 35.72 21.81 26.12
聚类+去重 46.66 30.68 31.28
Comparison of Extraction Model Effects
模型 ROUGE-1/% ROUGE-2/% ROUGE-L/%
STNLTP-Text 55.48 41.05 45.51
-Transformer 41.00 25.10 32.19
-聚类 20.55 7.12 14.79
Results of Ablation Experiments
[1] 万小丽, 朱雪忠. 专利价值的评估指标体系及模糊综合评价[J]. 科研管理, 2008, 29(2): 185-191.
[1] ( Wan Xiaoli, Zhu Xuezhong. The Indicator System and Fuzzy Comprehensive Evaluation of Patent Value[J]. Science Research Management, 2008, 29(2): 185-191.)
[2] 张乐, 冷基栋, 吕学强, 等. RLCPAR:一种基于强化学习的中文专利摘要改写模型[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[2] ( Zhang Le, Leng Jidong, Lv Xueqiang, et al. RLCPAR: A Rewriting Model for Chinese Patent Abstracts Based on Reinforcement Learning[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 59-69.)
[3] Berg-Kirkpatrick T, Gillick D, Klein D. Jointly Learning to Extract and Compress[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies. 2011: 481-490.
[4] Nallapati R, Zhai F, Zhou B. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3075-3081.
[5] Liu Y, Lapata M. Text Summarization with Pretrained Encoders[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 2019: 3730-3740.
[6] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2004: 404-411.
[7] Foltz P W. Latent Semantic Analysis for Text-Based Research[J]. Behavior Research Methods, Instruments and Computers, 1996, 28(2): 197-202.
doi: 10.3758/BF03204765
[8] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[9] Lee D D, Seung H S. Learning the Parts of Objects by Non-Negative Matrix Factorization[J]. Nature, 1999, 401(6755): 788-791.
doi: 10.1038/44565
[10] Gong Y H, Liu X. Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis[C]// Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001: 19-25.
[11] Kar M, Nunes S, Ribeiro C. Summarization of Changes in Dynamic Text Collections Using Latent Dirichlet Allocation Model[J]. Information Processing & Management, 2015, 51(6): 809-833.
doi: 10.1016/j.ipm.2015.06.002
[12] 章成志, 童甜甜, 周清清. 基于细粒度评论挖掘的书评自动摘要研究[J]. 情报学报, 2021, 40(2): 163-172.
[12] ( Zhang Chengzhi, Tong Tiantian, Zhou Qingqing. Automatic Summarization of Book Reviews Based on Fine-Grained Review Mining[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(2): 163-172.)
[13] 邝砾, 施如意, 赵雷浩, 等. 大粒度Pull Request描述自动生成[J]. 软件学报, 2021, 32(6): 1597-1611.
[13] ( Kuang Li, Shi Ruyi, Zhao Leihao, et al. Automatic Generation of Large-Granularity Pull Request Description[J]. Journal of Software, 2021, 32(6): 1597-1611.)
[14] 朱永清, 赵鹏, 赵菲菲, 等. 基于深度学习的生成式文本摘要技术综述[J]. 计算机工程, 2021, 47(11): 11-21.
[14] ( Zhu Yongqing, Zhao Peng, Zhao Feifei, et al. Survey on Abstractive Text Summarization Technologies Based on Deep Learning[J]. Computer Engineering, 2021, 47(11): 11-21.)
[15] See A, Liu P J, Manning C D. Get to the Point: Summarization with Pointer-Generator Networks[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2017: 1073-1083.
[16] Chung T L, Xu B, Liu Y, et al. Main Point Generator: Summarizing with a Focus[C]// Proceedings of the 23rd International Conference on Database Systems for Advanced Applications. Springer, 2018: 924-932.
[17] Cohan A, Dernoncourt F, Kim D S, et al. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Association for Computational Linguistics, 2018: 615-621.
[18] 王帅, 赵翔, 李博, 等. TP-AS: 一种面向长文本的两阶段自动摘要方法[J]. 中文信息学报, 2018, 32(6): 71-79.
[18] ( Wang Shuai, Zhao Xiang, Li Bo, et al. TP-AS: A Two-Phase Approach to Long Text Automatic Summarization[J]. Journal of Chinese Information Processing, 2018, 32(6): 71-79.)
[19] 谭金源, 刁宇峰, 杨亮, 等. 基于BERT-SUMOPN模型的抽取-生成式文本自动摘要[J]. 山东大学学报(理学版), 2021, 56(7): 82-90.
[19] Tan Jinyuan, Diao Yufeng, Yang Liang, et al. Extractive-Abstractive Text Automatic Summary Based on BERT-SUMOPN Model[J]. Journal of Shandong University(Natural Science), 2021, 56(7): 82-90.)
[20] 束云峰, 王中卿. 基于专利结构的中文专利摘要研究[J]. 计算机科学, 2020, 47(S1): 45-48.
[20] ( Shu Yunfeng, Wang Zhongqing. Research on Chinese Patent Summarization Based on Patented Structure[J]. Computer Science, 2020, 47(S1): 45-48.)
[21] Zhang X X, Lapata M, Wei F R, et al. Neural Latent Extractive Document Summarization[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018: 779-784.
[22] Dong Z D, Dong Q. HowNet—A Hybrid Language and Knowledge Resource[C]// Proceedings of the 2003 International Conference on Natural Language Processing and Knowledge Engineering. 2003: 820-824.
[23] Niu Y L, Xie R B, Liu Z Y, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2017: 2049-2058.
[24] Salton G, Buckley C. Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523.
doi: 10.1016/0306-4573(88)90021-0
[25] Day W H E, Edelsbrunner H. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods[J]. Journal of Classification, 1984, 1(1): 7-24.
doi: 10.1007/BF01890115
[26] Lin C Y. Rouge: A Package for Automatic Evaluation of Summaries[C]// Proceedings of the 2004 Workshop on Text Summarization Branches Out. Association for Computational Linguistics, 2004: 74-81.
[27] Chen Y C, Bansal M. Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2018: 675-686.
[1] Duan Jianyong, Xu Lishan, Liu Jie, Li Xin, Zhang Jiaming, Wang Hao. Question Generation Based on Sememe Knowledge and Bidirectional Attention Flow[J]. 数据分析与知识发现, 2022, 6(5): 44-53.
[2] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[3] Zhang Le, Leng Jidong, Lv Xueqiang, Cui Zhuo, Wang Lei, You Xindong. RLCPAR: A Rewriting Model for Chinese Patent Abstracts Based on Reinforcement Learning[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[4] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[5] Dai Zhihong, Hao Xiaoling. Extracting Hypernym-Hyponym Relationship for Financial Market Applications[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[6] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[7] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[8] Xiuxian Wen,Jian Xu. Research on Product Characteristics Extraction and Hedonic Price Based on User Comments[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[9] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[10] Hui Li,Yaqing Chai. Fine-Grained Sentiment Analysis Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[11] Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[12] Hu Jiaheng,Cen Yonghua,Wu Chengyao. Constructing Sentiment Dictionary with Deep Learning: Case Study of Financial Data[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[13] Zhai Dongsheng,Hu Dengjin,Zhang Jie,He Xijun,Liu He. Hierarchical Classification Model for Invention Patents[J]. 数据分析与知识发现, 2017, 1(12): 63-73.
[14] Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[15] Hu Zewen, Wang Xiaoyue, Bai Rujiang. Study on Text Classification Model Based on SUMO and WordNet Ontology Integration[J]. 现代图书情报技术, 2011, 27(1): 31-38.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn