Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (1): 113-127    DOI: 10.11925/infotech.2096-3467.2022.0402
Current Issue | Archive | Adv Search |
A Modified Hybrid Method to Identify Cited Spans
Nie Weimin,Ou Shiyan()
School of Information Management, Nanjing University, Nanjing 210023, China
Download: PDF (1105 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new algorithm to identify the cited contents, aiming to address the issues facing the existing unsupervised models and extend the granularity of single sentence to several adjacent ones. [Methods] First, we established a modified hybrid method with supervised ranking to select candidates from all sentences of the cited literature. Then, we used regression technique to determine the sentences with the cited segments. Third, we used the grouped adjacent sentences of the cited literature, namely n-sent, as inputs to the modified hybrid method. Finally, we conducted the intraclass normalization to identify the cited contents. [Results] The modified hybrid method yielded sentence overlapping F1 value of 0.167 on the test set of CL-SciSumm 2019 and 2020. With 3-sent as input, the modified hybrid method improved the sentence overlapping F1 value from 0.083 to 0.158 after intraclass Z-score normalization. [Limitations] The modified hybrid method did not utilize the sentence positions of the cited literature. In addition, the prospect of applying the proposed method to downstream tasks remains vague. [Conclusions] The proposed method could effectively identify cited segments, of which the granularity ranges from single sentence to multiple adjacent sentences.

Key wordsScientific Literature      Cited Spans      Supervised Ranking      Regression      Intraclass Normalization     
Received: 26 April 2022      Published: 16 February 2023
ZTFLH:  G353 TP391  
Fund:National Social Science Fund of China(17ATQ001)
Corresponding Authors: Ou Shiyan,ORCID:0000-0001-8617-6987,E-mail: oushiyan@nju.edu.cn。   

Cite this article:

Nie Weimin, Ou Shiyan. A Modified Hybrid Method to Identify Cited Spans. Data Analysis and Knowledge Discovery, 2023, 7(1): 113-127.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0402     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I1/113

An Example of Citation Relation
Process of Identifying Cited Spans
Supervised Ranking Process Based on SBERT
Determination Process of Cited Spans Based on RBERT
被引片段句子构成 数量 占比
不连续的单句 594 78.88%
连续两个单句 123 16.34%
连续三个单句 29 3.85%
连续4个单句 5 0.66%
连续5个单句 2 0.27%
总计 753 100.00%
Statistics of the Composition of Cited Spans
Input Mode of n-Sent
预训练语言模型名称 Top-N值 SO-P SO-R SO-F ROUGE-P ROUGE-R ROUGE-F
all-MPNet-base-v2 Top1 0.048 0.180 0.075 0.242 0.069 0.101
Top2 0.046 0.215 0.075 0.244 0.086 0.116
Top3 0.046 0.130 0.068 0.248 0.081 0.113
multi-qa-MPNet-base-dot-v1 Top1 0.122 0.115 0.118 0.148 0.165 0.146
Top2 0.108 0.204 0.142 0.254 0.080 0.113
Top3 0.088 0.249 0.130 0.311 0.047 0.076
all-MiniLM-L6-v2 Top1 0.127 0.120 0.124 0.242 0.084 0.114
Top2 0.097 0.182 0.126 0.244 0.069 0.101
Top3 0.122 0.115 0.118 0.224 0.078 0.106
Performance of the SBERT Employing Various Pretrained Language Models
预训练语言模型名称 Top-N值 SO-P SO-R SO-F ROUGE-P ROUGE-R ROUGE-F
BERT-base-uncased Top1 0.117 0.221 0.153 0.298 0.067 0.104
Top2 0.122 0.231 0.160 0.307 0.072 0.111
Top3 0.104 0.296 0.154 0.386 0.040 0.069
SciBERT Top1 0.105 0.199 0.138 0.277 0.068 0.102
Top2 0.116 0.220 0.152 0.299 0.072 0.108
Top3 0.114 0.215 0.149 0.290 0.064 0.099
ALBERT-base-v2 Top1 0.051 0.095 0.066 0.156 0.034 0.052
Top2 0.083 0.157 0.109 0.220 0.049 0.075
Top3 0.073 0.138 0.096 0.187 0.053 0.077
RoBERTa-base Top1 0.025 0.047 0.033 0.096 0.014 0.023
Top2 0.111 0.209 0.144 0.289 0.064 0.099
Top3 0.096 0.093 0.125 0.255 0.052 0.082
Performance of the RBERT Employing Various Pretrained Language Models
m and n
">
Performance of the Modified Hybrid Method Under Various Combinations of m and n
系统名 SO-P SO-R SO-F ROUGE-P ROUGE-R ROUGE-F
PINGAN TECH 0.132 0.246 0.172 0.298 0.113 0.147
本研究改进混合方法 0.128 0.242 0.167 0.312 0.075 0.115
uniHD 0.116 0.260 0.161 0.317 0.085 0.113
本研究RBERT模型 0.122 0.231 0.160 0.307 0.072 0.111
本研究SBERT模型 0.108 0.204 0.142 0.254 0.080 0.113
CMU 0.087 0.246 0.128 0.307 0.049 0.075
NaCTeM-UoM / / 0.126 / / 0.075
NJU / / 0.124 / / 0.090
BUPT / / 0.106 / / 0.034
Comparison Between Distinct Systems
n元句 SO-P SO-R SO-F ROUGE-P ROUGE-R ROUGE-F
1元句 0.128 0.242 0.167 0.312 0.075 0.115
2元句 0.064 0.206 0.098 0.295 0.038 0.061
3元句 0.056 0.156 0.083 0.240 0.037 0.080
Performance of the Modified Hybrid Method with n-Sent as Input
标准化方法 SO-P SO-R SO-F ROUGE-P ROUGE-R ROUGE-F
均值标准化 0.104 0.218 0.141 0.293 0.071 0.108
最小-最大标准化 0.061 0.185 0.092 0.279 0.041 0.065
Z值标准化 0.119 0.233 0.158 0.311 0.074 0.113
Performance of the Modified Hybrid Method After Distinct Normalization Methods
[1] 叶继元. “SCI至上”的要害、根源与破解之道[J]. 情报学报, 2020, 39(8): 787-795.
[1] ( Ye Jiyuan. The Keys, Roots, and Solutions To “SCI Supremacy”[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 787-795.)
[2] 国务院办公厅. 关于完善科技成果评价机制的指导意见[EB/OL]. [2022-03-12]. http://www.gov.cn/zhengce/content/2021-08/02/content_5628987.htm.
[2] ( General Office of the State Council. Guidance on Improving the Evaluation Mechanism of Scientific and Technological Achievements[EB/OL]. [2022-03-12]. http://www.gov.cn/zhengce/content/2021-08/02/content_5628987.htm.)
[3] 卢超, 章成志, 王玉琢, 等. 语义特征分析的深化——学术文献的全文计量分析研究综述[J]. 中国图书馆学报, 2021, 47(2): 110-131.
[3] ( Lu Chao, Zhang Chengzhi, Wang Yuzhuo, et al. Strengthened Analyses of Semantic Features: Review of Full-Text Bibliometrics of Academic Documents[J]. Journal of Library Science in China, 2021, 47(2): 110-131.)
[4] 李文文, 陈雅. 国内外Data Curation研究综述[J]. 情报资料工作, 2013(5): 35-38.
[4] ( Li Wenwen, Chen Ya. Summary of Data Curation Research at Home and Abroad[J]. Information and Documentation Services, 2013(5): 35-38.)
[5] 徐健, 李纲, 毛进, 等. 文献被引片段特征分析与识别研究[J]. 数据分析与知识发现, 2017, 1(11): 37-45.
[5] ( Xu Jian, Li Gang, Mao Jin, et al. Recognizing and Analyzing Cited Spans in Literature[J]. Data Analysis and Knowledge Discovery, 2017, 1(11): 37-45.)
[6] 金贤日, 欧石燕. 无监督引用文本自动识别与分析[J]. 数据分析与知识发现, 2021, 5(1): 66-77.
[6] ( Kim Hyonil, Ou Shiyan. Identifying Citation Texts with Unsupervised Method[J]. Data Analysis and Knowledge Discovery, 2021, 5(1): 66-77.)
[7] Chandrasekaran M K, Yasunaga M, Radev D R, et al. Overview and Results: CL-SciSumm Shared Task 2019[C]// Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 153-166.
[8] Jaidka K, Chandrasekaran M K, Rustagi S, et al. Overview of the CL-SciSumm 2016 Shared Task[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 2016 Joint Conference on Digital Libraries. 2016: 93-102.
[9] Li L, Mao L, Zhang Y, et al. CIST System for CL-SciSumm 2016 Shared Task[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 2016 Joint Conference on Digital Libraries. 2016: 156-167.
[10] La Quatra M, Cagliero L, Baralis E. Poli2Sum@CL-SciSumm-19: Identify, Classify, and Summarize Cited Text Spans by Means of Ensembles of Supervised Models[C]// Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 233-246.
[11] Ma S, Zhang H, Xu J, et al. NJUST @CLSciSumm-18[C]// Proceedings of the 3rd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 2018: 114-129.
[12] Wang P, Li S, Wang T, et al. NUDT @CLSciSumm-18[C]// Proceedings of the 3rd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 2018: 102-113.
[13] Nomoto T. NEAL: A Neurally Enhanced Approach to Linking Citation and Reference[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 2016 Joint Conference on Digital Libraries. 2016: 168-174.
[14] Prasad A. WING-NUS at CL-SciSumm 2017:Learning from Syntactic and Semantic Similarity for Citation Contextualization[C]// Proceedings of the 2017 Computational Linguistics Scientific Summarization Shared Task Organized as a Part of the 2nd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries and Co-Located, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017: 26-32.
[15] Zerva C, Nghiem M Q, Nguyen N T H, et al. NaCTeM-UoM @CL-SciSumm 2019[C]// Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 167-180.
[16] Chai L, Fu G Z, Ni Y. NLP-PINGAN-TECH @CL-SciSumm 2020[C]// Proceedings of the 1st Workshop on Scholarly Document Processing. 2020: 235-241.
[17] Alonso H M, Makki R, Gu J. CL-SciSumm Shared Task-Team Magma[C]// Proceedings of the 3rd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 2018: 172-176.
[18] Moraes L, Baki S, Verma R, et al. University of Houston at CL-SciSumm 2016: SVMs with Tree Kernels and Sentence Similarity[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 2016 Joint Conference on Digital Libraries. 2016: 113-121.
[19] Zhang D, Li S. PKU @CLSciSumm-17: Citation Contextualization[C]// Proceedings of the 2017 Computational Linguistics Scientific Summarization Shared Task Organized as a Part of the 2nd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries and Co-Located, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017: 86-93.
[20] Kim H, Ou S. NJU@CL-SciSumm-19[C]// Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 247-255.
[21] 章成志, 徐津, 马舒天. 学术文本被引片段的自动识别研究[J]. 情报理论与实践, 2019, 42(9): 139-145.
[21] ( Zhang Chengzhi, Xu Jin, Ma Shutian. Automatic Identification of Cited Spans in Academic Articles[J]. Information Studies: Theory & Application, 2019, 42(9): 139-145.)
[22] Jaidka K, Chandrasekaran M K, Elizalde B F, et al. The Computational Linguistics Summarization Pilot Task[C]// Proceedings of the 2014 Text Analysis Conference. 2014: 1-12.
[23] Cohan A, Soldaini L. Towards Citation-Based Summarization of Biomedical Literature[C]// Proceedings of the 2014 Text Analysis Conference. 2014: 79-87.
[24] Felber T, Kern R. Graz University of Technology at CL-SciSumm 2017: Query Generation Strategies[C]// Proceedings of the 2017 Computational Linguistics Scientific Summarization Shared Task Organized as a Part of the 2nd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries and Co-Located, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017: 67-72.
[25] Lu K, Mao J, Li G, et al. Recognizing Reference Spans and Classifying Their Discourse Facets[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 2016 Joint Conference on Digital Libraries. 2016: 139-145.
[26] Cao Z, Li W, Wu D. PolyU at CL-SciSumm 2016[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 2016 Joint Conference on Digital Libraries. 2016: 132-138.
[27] Klampfl S, Rexha A, Kern R. Identifying Referenced Text in Scientific Publications by Summarisation and Classification Techniques[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 2016 Joint Conference on Digital Libraries. 2016: 122-131.
[28] Aumiller D, Almasian S, Hausner P, et al. UniHD@CL-SciSumm 2020: Citation Extraction as Search[C]// Proceedings of the 1st Workshop on Scholarly Document Processing. 2020: 261-269.
[29] Lauscher A, Glavas G, Eckert K. University of Mannheim @CLSciSumm-17: Citation-Based Summarization of Scientific Articles Using Semantic Textual Similarity[C]// Proceedings of the 2017 Computational Linguistics Scientific Summarization Shared Task Organized as a Part of the 2nd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries and Co-Located, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017: 33-42.
[30] Bromley J, Bentz J W, Bottou L, et al. Signature Verification Using a “Siamese” Time Delay Neural Network[J]. International Journal of Pattern Recognition and Artificial Intelligence, 1993, 7(4): 669-688.
doi: 10.1142/S0218001493000339
[31] Moraes L F, Das A, Karimi S, et al. University of Houston @CL-SciSumm 2018[C]// Proceedings of the 3rd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 2018: 142-149.
[32] Fergadis A, Pappas D, Papageorgiou H. ATHENA@CL-SciSumm 2019: Siamese Recurrent Bi-Directional Neural Network for Identifying Cited Text Spans[C]// Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 256-262.
[33] Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3982-3992.
[34] Mahurkar S, Patil R. LRG at SemEval-2020 Task 7: Assessing the Ability of BERT and Derivative Models to Perform Short-Edits Based Humor Grading[C]// Proceedings of the 14th Workshop on Semantic Evaluation. 2020: 858-864.
[35] Henderson M, Al-Rfou R, Strope B, et al. Efficient Natural Language Response Suggestion for Smart Reply[OL]. arXiv Preprint, arXiv: 1705.00652.
[36] Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of the 2004 Workshop on Text Summarization Branches Out. 2004: 74-81.
[37] Kenton D, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[38] Song K, Tan X, Qin T, et al. MPNet: Masked and Permuted Pre-training for Language Understanding[C]// Proceedings of the 2020 Annual Conference on Neural Information Processing Systems. 2020: 16857-16867.
[39] Wang W, Wei F, Dong L, et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers[C]// Proceedings of the 2020 Annual Conference on Neural Information Processing Systems. 2020: 5776-5788.
[40] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, the 9th International Joint Conference on Natural Language Processing. 2019: 3615-3620.
[41] Lan Z, Chen M, Goodman S, et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations[C]// Proceedings of the 8th International Conference on Learning Representations. 2020: 1-17.
[42] Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[43] Umapathy A, Radhakrishnan K, Jain K, et al. CiteQA@CLSciSumm 2020[C]// Proceedings of the 1st Workshop on Scholarly Document Processing. 2020: 297-302.
[44] Li L, Zhu Y, Xie Y, et al. CIST@CLSciSumm-19: Automatic Scientific Paper Summarization with Citances and Facets[C]// Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries Co-Located, the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 196-207.
[1] Zhang Dongyu, Gu Feng, Cui Zijuan, Hu Shaoxiang, Zhang Wei, Lin Hongfei. Reviewing Metaphor Research Based on Keyword Extraction Algorithm[J]. 数据分析与知识发现, 2022, 6(4): 130-138.
[2] Wei Tingting, Jiang Tao, Zheng Shuling, Zhang Jiantao. Extracting Chinese Patent Keywords with LSTM and Logistic Regression[J]. 数据分析与知识发现, 2022, 6(2/3): 308-317.
[3] Wang Qinjie, Qin Chunxiu, Ma Xubu, Liu Huailiang, Xu Cunzhen. Recommending Scientific Literature Based on Author Preference and Heterogeneous Information Network[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[4] Zhou Heng,Chen Zhangjian,Li Aiqin,Cheng Xiaoqiang,Wu Huayi. Spatial Distribution and Socio-economic Driving Forces of Residential Changes: Case Study of Zhejiang Province[J]. 数据分析与知识发现, 2020, 4(9): 81-90.
[5] Zhong Lizhen,Ma Minshu,Zhou Changfeng. Forecasting Airfare Based on Route Characteristics[J]. 数据分析与知识发现, 2020, 4(2/3): 192-199.
[6] Ding Shengchun,Yu Fengyang,Li Zhen. Identifying Potential Trending Topics of Online Public Opinion[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[7] Xianlai Chen,Chaopeng Han,Ying An,Li Liu,Zhongmin Li,Rong Yang. Extracting New Words with Mutual Information and Logistic Regression[J]. 数据分析与知识发现, 2019, 3(8): 105-113.
[8] Shijie Song,Yuxiang Zhao,Wenting Han,Qinghua Zhu. The Inhibition Effect of Health Literacy on Health Risk Under the Internet Environment: An Empirical Study of Chronic Diseases Based on CHNS Data[J]. 数据分析与知识发现, 2019, 3(4): 13-21.
[9] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[10] Wenxiu Hu,Li Ma,Jianfeng Zhang. Identifying Ultra-short-term Market Manipulation with Stock Intraday Trading Weighted Network[J]. 数据分析与知识发现, 2019, 3(10): 118-126.
[11] Wang Jiaqi,Zhang Junsheng,Qiao Xiaodong. Analyzing Representation and Semantic Links of Scientific Research Events[J]. 数据分析与知识发现, 2018, 2(5): 32-39.
[12] Zhang Hongli,Liu Jiying,Yang Sinan,Xu Jian. Predicting Online Users’ Ratings with Comments[J]. 数据分析与知识发现, 2017, 1(8): 48-58.
[13] Xu Jian,Li Gang,Mao Jin,Ye Guanghui. Recognizing and Analyzing Cited Spans in Literature[J]. 数据分析与知识发现, 2017, 1(11): 37-45.
[14] He Huixin,Liu Lijuan. A Scientific Research Object Labeling System Based on Active earning[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[15] Wang Ying, Wu Zhenxin, Xie Jing. Review on Semantic Retrieval System for Scientific Literature[J]. 现代图书情报技术, 2015, 31(5): 1-7.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn