|
|
An Overview of Research on Multi-Document Summarization |
Bao Ritong1,Sun Haichun1,2() |
1School of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China 2Key Laboratory of Security Technology & Risk Assessment, People's Public Security University of China, Beijing 100026, China |
|
|
Abstract [Objective] This paper reviews the literature on multi-document summarization, aiming to examine their research frameworks and mainstream models. [Coverage] We searched the AI Open Index, Paper with Code, and CNKI databases with queries “multi-document summarization” and “多文档摘要”. A total of 76 representative articles were retrieved. [Methods] We summarized the mainstream research frameworks, the latest models, and algorithms of multi-document summarization technology. We also present prospects for future studies. [Results] This paper compared the strengths and weaknesses of the latest models for multi-document summarization to the traditional methods. We also summarized high-quality multi-document summarization datasets and current evaluation metrics. [Limitations] We only discussed the evaluation results of some popular models on the Multi-News dataset, lacking a comparison of all models on the same dataset. [Conclusions] Many challenges remain in the task of multi-document summarization, including the generated summaries' low factual accuracy and the models' poor generality.
|
Received: 22 November 2022
Published: 11 April 2023
|
|
Fund:Ministry of Public Security Technology Research Program(2020JSYJC22);Beijing Municipal Natural Science Foundation Program(4184099) |
Corresponding Authors:
Sun Haichun,E-mail: sunhaichun@ppsuc.edu.cn。
|
[1] |
Luhn H P. The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165.
doi: 10.1147/rd.22.0159
|
[2] |
Nallapati R, Zhou B W, dos Santos C, et al. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond[C]// Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 280-290.
|
[3] |
Zopf M. Auto-HMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus[C]// Proceedings of the 11th International Conference on Language Resources and Evaluation. European Language Resources Association, 2018.
|
[4] |
Liu P J, Saleh M, Pot E, et al. Generating Wikipedia by Summarizing Long Sequences[OL]. arXiv Preprint, arXiv: 1801.10198.
|
[5] |
Fabbri A, Li I, She T W, et al. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1074-1084.
|
[6] |
Gu J T, Lu Z D, Li H, et al. Incorporating Copying Mechanism in Sequence-to-Sequence Learning[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1631-1640.
|
[7] |
Tu Z P, Lu Z D, Liu Y, et al. Modeling Coverage for Neural Machine Translation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 76-85.
|
[8] |
Yasunaga M, Zhang R, Meelu K, et al. Graph-Based Neural Multi-Document Summarization[C]// Proceedings of the 21st Conference on Computational Natural Language Learning. 2017: 452-462.
|
[9] |
Paulus R, Xiong C M, Socher R. A Deep Reinforced Model for Abstractive Summarization[OL]. arXiv Preprint, arXiv: 1705.04304.
|
[10] |
Cho S, Lebanoff L, Foroosh H, et al. Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1027-1038.
|
[11] |
黄文彬, 倪少康. 多文档自动摘要方法的进展研究[J]. 情报科学, 2017, 35(4): 160-165.
|
[11] |
(Huang Wenbin, Ni Shaokang. Study of the Development of Multi-Document Automatic Summarization[J]. Information Science, 2017, 35(4): 160-165.)
|
[12] |
李金鹏, 张闯, 陈小军, 等. 自动文本摘要研究综述[J]. 计算机研究与发展, 2021, 58(1): 1-21.
|
[12] |
(Li Jinpeng, Zhang Chuang, Chen Xiaojun, et al. Survey on Automatic Text Summarization[J]. Journal of Computer Research and Development, 2021, 58(1): 1-21.)
|
[13] |
Jalil Z, Nasir J A, Nasir M. Extractive Multi-Document Summarization: A Review of Progress in the Last Decade[J]. IEEE Access, 2021, 9: 130928-130946.
doi: 10.1109/ACCESS.2021.3112496
|
[14] |
Hu B T, Chen Q C, Zhu F Z. LCSTS: A Large Scale Chinese Short Text Summarization Dataset[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 1967-1972.
|
[15] |
Grusky M, Naaman M, Artzi Y. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 708-719.
|
[16] |
Narayan S, Cohen S B, Lapata M. Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 1797-1807.
|
[17] |
Xu R X, Cao J, Wang M X, et al. Xiaomingbot: A Multilingual Robot News Reporter[OL]. arXiv Preprint, arXiv: 2007.08005.
|
[18] |
Yasunaga M, Kasai J, Zhang R, et al. ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 7386-7393.
doi: 10.1609/aaai.v33i01.33017386
|
[19] |
Katsimpras G, Paliouras G. Predicting Intervention Approval in Clinical Trials Through Multi-Document Summarization[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022: 1947-1957.
|
[20] |
曾昭霖, 严馨, 徐广义, 等. 基于层级BiGRU+Attention的面向查询的新闻多文档抽取式摘要方法[J]. 小型微型计算机系统, 2023, 44(1): 185-192.
|
[20] |
(Zeng Zhaolin, Yan Xin, Xu Guangyi, et al. Query-Oriented News Multi-Document Extractive Summarization Method Based on Hierarchical BiGRU+Attention[J]. Journal of Chinese Computer Systems, 2023, 44(1): 185-192.)
|
[21] |
Zhao C, Huang T H, Chowdhury S B R, et al. Read Top News First: A Document Reordering Approach for Multi-Document News Summarization[OL]. arXiv Preprint, arXiv: 2203.10254.
|
[22] |
Jin H Q, Wang T M, Wan X J. Multi-Granularity Interaction Network for Extractive and Abstractive Multi-Document Summarization[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 6244-6254.
|
[23] |
Antognini D, Faltings B. Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization[OL]. arXiv Preprint, arXiv: 1909.12231.
|
[24] |
Zhao J M, Liu M, Gao L X, et al. SummPip: Unsupervised Multi-Document Summarization with Sentence Graph Compression[C]// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2020: 1949-1952.
|
[25] |
Wang D Q, Liu P F, Zheng Y N, et al. Heterogeneous Graph Neural Networks for Extractive Document Summarization[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 6209-6219.
|
[26] |
Zhou H, Ren W D, Liu G S, et al. Entity-Aware Abstractive Multi-Document Summarization[C]// Proceedings of the 2021 International Joint Conference on Natural Language Processing. 2021: 351-362.
|
[27] |
Cui P, Hu L. Topic-Guided Abstractive Multi-Document Summarization[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021: 1463-1472.
|
[28] |
Li W, Xiao X Y, Liu J C, et al. Leveraging Graph to Improve Abstractive Multi-Document Summarization[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 6232-6243.
|
[29] |
Hickmann M L, Wurzberger F, Hoxhalli M, et al. Analysis of GraphSum's Attention Weights to Improve the Explainability of Multi-Document Summarization[C]// Proceedings of the 23rd International Conference on Information Integration and Web Intelligence. ACM, 2021: 359-366.
|
[30] |
Chen M Y, Li W, Liu J C, et al. SgSum: Transforming Multi-Document Summarization into Sub-Graph Selection[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2021: 4063-4074.
|
[31] |
Nayeem M T, Fuad T A, Chali Y. Abstractive Unsupervised Multi-Document Summarization Using Paraphrastic Sentence Fusion[C]// Proceedings of the 27th International Conference on Computational Linguistics. 2018: 1191-1204.
|
[32] |
Alambo A, Lohstroh C, Madaus E, et al. Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles[C]// Proceedings of the 2020 IEEE International Conference on Big Data. IEEE, 2020: 591-596.
|
[33] |
Saeed M Y, Awais M, Talib R, et al. Unstructured Text Documents Summarization with Multi-Stage Clustering[J]. IEEE Access, 2020, 8: 212838-212854.
doi: 10.1109/Access.6287639
|
[34] |
Ernst O, Caciularu A, Shapira O, et al. A Proposition-Level Clustering Approach for Multi-Document Summarization[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2021: 1765-1779.
|
[35] |
Alqaisi R, Ghanem W, Qaroush A. Extractive Multi-Document Arabic Text Summarization Using Evolutionary Multi-Objective Optimization with K-Medoid Clustering[J]. IEEE Access, 2020, 8: 228206-228224.
doi: 10.1109/ACCESS.2020.3046494
|
[36] |
Coavoux M, Elsahar H, Gallé M. Unsupervised Aspect-Based Multi-Document Abstractive Summarization[C]// Proceedings of the 2nd Workshop on New Frontiers in Summarization. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 42-47.
|
[37] |
Brazinskas A, Lapata M, Titov I. Unsupervised Multi-Document Opinion Summarization as Copycat-Review Generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019: 5151-5169.
|
[38] |
Pasunuru R, Liu M W, Bansal M, et al. Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2021: 4768-4779.
|
[39] |
Liu Y, Lapata M. Hierarchical Transformers for Multi-Document Summarization[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 5050-5081.
|
[40] |
Perez-Beltrachini L, Lapata M. Multi-Document Summarization with Determinantal Point Process Attention[J]. Journal of Artificial Intelligence Research, 2021, 71: 371-399.
doi: 10.1613/jair.1.12522
|
[41] |
Liu S Q, Cao J N, Yang R S, et al. Highlight-Transformer: Leveraging Key Phrase Aware Attention to Improve Abstractive Multi-Document Summarization[C]// Proceedings of the 2021 International Joint Conference on Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2021: 5021-5027.
|
[42] |
Ma C B, Zhang W E, Wang H, et al. Incorporating Linguistic Knowledge for Abstractive Multi-Document Summarization[C]// Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation. 2022: 147-156.
|
[43] |
Kim S. Using Pre-Trained Transformer for Better Lay Summarization[C]// Proceedings of the 1st Workshop on Scholarly Document Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 328-335.
|
[44] |
Zou Y Y, Zhang X X, Lu W, et al. Pre-Training for Abstractive Document Summarization by Reinstating Source Text[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 3646-3660.
|
[45] |
Aghajanyan A, Okhonko D, Lewis M, et al. HTLM: Hyper-Text Pre-Training and Prompting of Language Model[OL]. arXiv Preprint, arXiv: 2107.06955.
|
[46] |
Goodwin T, Savery M, Demner-Fushman D.Flight of the PEGASUS? Comparing Transformers on Few-Shot and Zero-Shot Multi-Document Abstractive Summarization[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 5640-5646.
|
[47] |
Lewis M, Liu Y H, Goyal N, et al. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7871-7880.
|
[48] |
Raffel C, Shazeer N M, Roberts A, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer[J]. Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
|
[49] |
Zhang J Q, Zhao Y, Saleh M, et al. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization[C]// Proceedings of the 37th International Conference on Machine Learning. 2020: 11328-11339.
|
[50] |
Beltagy I, Peters M E, Cohan A. Longformer: The Long-Document Transformer[OL]. arXiv Preprint, arXiv: 2004.05150.
|
[51] |
Moro G, Ragazzi L, Valgimigli L, et al. Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022: 180-189.
|
[52] |
Zaheer M, Guruganesh G, Dubey A, et al. Big Bird: Transformers for Longer Sequences[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. ACM, 2020: 17283-17297.
|
[53] |
Guo M, Ainslie J, Uthus D, et al. LongT5: Efficient Text-to-Text Transformer for Long Sequences[C]// Proceedings of the 2022 North American Chapter of the Association for Computational Linguistics. 2022: 724-736.
|
[54] |
Xiao W, Beltagy I, Carenini G, et al. PRIMERA: Pyramid-Based Masked Sentence Pre-Training for Multi-Document Summarization[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022: 5245-5263.
|
[55] |
Puduppully R, Steedman M. Multi-Document Summarization with Centroid-Based Pretraining[OL]. arXiv Preprint, arXiv: 2208.01006.
|
[56] |
Goldstein J, Carbonell J. Summarization: (1) Using MMR for Diversity - Based Reranking and (2) Evaluating Summaries[C]// TIPSTER'98:Proceedings of a Workshop on Held at Baltimore, Maryland. 1998.
|
[57] |
Akhtar N, Beg M M S, Hussain M M, et al. Extractive Multi-Document Summarization Using Relative Redundancy and Coherence Scores[J]. Journal of Intelligent & Fuzzy Systems, 2020, 38(5): 6201-6210.
|
[58] |
See A, Liu P J, Manning C D. Get to the Point: Summarization with Pointer-Generator Networks[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1073-1083.
|
[59] |
Lebanoff L, Song K Q, Liu F. Adapting the Neural Encoder-Decoder Framework from Single to Multi-Document Summarization[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 4131-4141.
|
[60] |
Chen Z J, Xu J, Liao M, et al. Two-Phase Multi-Document Event Summarization on Core Event Graphs[J]. Journal of Artificial Intelligence Research, 2022, 74: 1037-1057.
doi: 10.1613/jair.1.13267
|
[61] |
Su A, Su D F, Mulvey J M, et al. PoBRL: Optimizing Multi-Document Summarization by Blending Reinforcement Learning Policies[OL]. arXiv Preprint, arXiv: 2105.08244.
|
[62] |
Song Y Z, Chen Y S, Shuai H H. Improving Multi-Document Summarization Through Referenced Flexible Extraction with Credit-Awareness[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2022:1667-1681.
|
[63] |
Parnell J, Unanue I J, Piccardi M. A Multi-Document Coverage Reward for RELAXed Multi-Document Summarization[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022: 5112-5128.
|
[64] |
Lu Y, Dong Y, Charlin L. Multi-XScience: A Large-Scale Dataset for Extreme Multi-Document Summarization of Scientific Articles[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 8068-8074.
|
[65] |
Ghalandari D G, Hokamp C, Pham N T, et al. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 1302-1308.
|
[66] |
Boni O, Feigenblat G, Lev G, et al. HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles[OL]. arXiv Preprint, arXiv: 2110.03179.
|
[67] |
Xu Y M, Lapata M. Coarse-to-Fine Query Focused Multi-Document Summarization[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 3632-3645.
|
[68] |
Xu Y M, Lapata M. Query Focused Multi-Document Summarization with Distant Supervision[OL]. arXiv Preprint, arXiv: 2004.03027.
|
[69] |
Nenkova A, Passonneau R, McKeown K. The Pyramid Method: Incorporating Human Content Selection Variation in Summarization Evaluation[J]. ACM Transactions on Speech and Language Processing, 2007, 4(2): Article No.4.
|
[70] |
Lin C Y, Hovy E. Automatic Evaluation of Summaries Using N-Gram Co-Occurrence Statistics[C]// Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. ACM, 2003: 71-78.
|
[71] |
Lin C Y, Hovy E. The Automated Acquisition of Topic Signatures for Text Summarization[C]// Proceedings of the 18th Conference on Computational Linguistics. ACM, 2000: 495-501.
|
[72] |
Papineni K, Roukos S, Ward T, et al. BLEU: A Method for Automatic Evaluation of Machine Translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACM, 2002: 311-318.
|
[73] |
Gao Y, Zhao W, Eger S. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 1347-1354.
|
[74] |
Wolhandler R, Cattan A, Ernst O, et al. How “Multi” is Multi-Document Summarization?[C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022: 5761-5769.
|
[75] |
Gehrmann S, Deng Y T, Rush A. Bottom-up Abstractive Summarization[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 4098-4109.
|
[76] |
梁梦英, 李德玉, 王素格, 等. Senti-PG-MMR: 多文档游记情感摘要生成方法[J]. 中文信息学报, 2022, 36(3): 128-135.
|
[76] |
(Liang Mengying, Li Deyu, Wang Suge, et al. Senti-PG-MMR: Research on Generation Method of Sentimental Summary of Multi-Document Travel Notes[J]. Journal of Chinese Information Processing, 2022, 36(3): 128-135.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|