Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (8): 25-33    DOI: 10.11925/infotech.2096-3467.2021.0226
Current Issue | Archive | Adv Search |
Extracting Citation Contents with Coreference Resolution
Tan Ying1(),Tang Yifei2
1School of Public Administration, Hubei University, Wuhan 430062, China
2School of Information Management, Central China Normal University, Wuhan 430079, China
Download: PDF (701 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to accurately extract scientific citations and their context data, which significantly improves the results of citation analysis. [Methods] We divided the citation extraction task into citation sentence extraction, citation context identification, and citation metadata. Then, we proposed a coreference resolution-based method to identify and extract scientific citation context. [Results] We examined our method with the Chinese sequential coding periodicals and extracted the citation sentences and references correctly. The F1 value for identifying the citation context was between 0.780 and 0.849. [Limitations] Due to the limits of Chinese scientific citation corpus and the small scale of experimental data, the proposed method might not work effectively in other fields. [Conclusions] Our study optimizes the steps of citation content analysis and enlarges data scope. It provides support for researchers of citation content analysis.

Key wordsInformation Extraction      Coreference Resolution      Citation Content      Citation Context     
Received: 08 March 2021      Published: 15 September 2021
ZTFLH:  G250  
Fund:National Social Science Fund of China(19ZDA345)
Corresponding Authors: Tan Ying ORCID:0000-0002-7987-4696     E-mail: tanying1219@qq.com

Cite this article:

Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution. Data Analysis and Knowledge Discovery, 2021, 5(8): 25-33.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0226     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I8/25

Framework of Citation Content Extraction
特征 含义
位置特征 句位置 引文句和候选上下文的位置和距离关系
标题位置 引文句和候选上下文是否位于同一标题
段落位置 引文句和候选上下文是否位于同一段落
段内位置 候选句位于段落的相对位置
指代特征 第三人称代词 句中是否含有第三人称代词
指示代词 句中是否有指示代词
语义特征 人名 句中是否包含引文作者名
文献名 句中是否包含文献名
专有名词 句中分别包含领域知识全称和简称
连词 句中是否包含连词
引文特征 候选句引文 候选上下文句是否包含引文
引文标识符数量 目标引文句中包含引文标识符个数
Features Used for Citation Context Identification
Example Annotation of a Citation Context
Relative Position of Citation Context
Result of Citation Sentence Extraction
类型 提及检测 筛选过滤 高频词
第三人称代词 149 15 他,他们,她
指示代词 449 384 该,其,这,此,另
人名 952 17
文献名 68 15
专有名词 36 36 LSA,LDA,NPLM
连词 1 777 549 然而,但,此外,总体而言
Result of Mentions Detection and Filter
序号 特征 类型 信息增益
1 与目标引文的位置距离 Nominal 0.328 05
2 候选上下文句是否包含引文 Nominal 0.240 31
3 目标引文句中的引文数量 Numeric 0.240 31
4 是否位于同一段落 Nominal 0.140 75
5 是否位于同一标题 Nominal 0.099 91
6 是否包含有效指示代词 Nominal 0.048 26
7 候选句的段落位置 Nominal 0.039 05
8 是否包含有效第三人称代词 Nominal 0.031 07
9 是否包含引文作者名 Nominal 0.030 64
10 是否包含文献名 Nominal 0.005 84
11 是否包含有效连词 Nominal 0.002 58
12 是否包含有效专有名词 Nominal 0.001 98
Features and Information Gain for Citation Context Identification
随机
样本集
初始特征集 过滤筛选后特征集
准确率 召回率 F1 准确率 召回率 F1
1 0.787 0.819 0.803 0.833 0.833 0.833
2 0.852 0.485 0.611 0.829 0.853 0.841
3 0.821 0.697 0.754 0.842 0.727 0.780
4 0.809 0.833 0.821 0.826 0.864 0.844
5 0.841 0.841 0.841 0.792 0.826 0.809
6 0.844 0.806 0.824 0.824 0.836 0.830
7 0.792 0.884 0.836 0.805 0.899 0.849
8 0.817 0.853 0.835 0.787 0.868 0.825
9 0.828 0.779 0.803 0.862 0.824 0.842
10 0.762 0.716 0.738 0.783 0.806 0.794
Comparison of Random Sample Performance of Filter Features with Baselines
[1] Small H. Citations and Consilience in Science[J]. Scientometrics, 1998, 43(1):143-148.
doi: 10.1007/BF02458403
[2] Bergmark D, Phempoonpanich P, Zhao S M. Scraping the ACM Digital Library[J]. ACM SIGIR Forum, 2001, 35(2):1-7.
[3] Bergmark D. Automatic Extraction of Reference Linking Information from Online Documents[R]. Cornell University, 2000.
[4] Sarawagi S, Vydiswaran V G V, Srinivasan S, et al. Resolving Citations in a Paper Repository[J]. ACM SIGKDD Explorations Newsletter, 2003, 5(2):156-157.
doi: 10.1145/980972.980995
[5] Giles C L, Bollacker K D, Lawrence S. CiteSeer: An Automatic Citation Indexing System[C]// Proceedings of the 3rd ACM Conference on Digital Libraries. 1998: 89-98.
[6] Wellner B, McCallum A, Peng F C, et al. An Integrated, Conditional Model of Information Extraction and Coreference with Applications to Citation Matching[C]// Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. 2004: 593-601.
[7] Takasu A. Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model[C]// Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries. IEEE Computer Society, 2003: 49-60.
[8] Ding Y, Chowdhury G, Foo S. Template Mining for the Extraction of Citation from Digital Documents[C]// Proceedings of the 2nd Asian Digital Library Conference. 1999: 47-62.
[9] Nanba H, Okumura M. Towards Multi-paper Summarization Using Reference Information[C]// Proceedings of International Joint Conference on Artificial Intelligence. 1999: 926-931.
[10] Nanba H, Kando N, Okumura M. Classification of Research Papers Using Citation Links and Citation Types: Towards Automatic Review Article Generation[J]. Advances in Classification Research Online, 2011, 11(1):117-134.
[11] Mei Q Z, Zhai C X. Generating Impact-Based Summaries for Scientific Literature[C]// Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. 2008: 816-824.
[12] Abu-Jbara A, Radev D. Reference Scope Identification in Citing Sentences[C]// Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012: 80-90.
[13] Qazvinian V, Radev D R. Identifying Non-explicit Citing Sentences for Citation-based Summarization[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010: 555-564.
[14] Qazvinian V, Radev D R. Scientific Paper Summarization Using Citation Summary Networks[OL]. arXiv Preprint, arXiv: 0807. 1560.
[15] Teufel S, Siddharthan A, Tidhar D. Automatic Classification of Citation Function[C]// Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 2006: 103-110.
[16] Teufel S, Siddharthan A, Tidhar D. An Annotation Scheme for Citation Function[C]// Proceedings of the 7th SIGDIAL Workshop on Discourse and Dialogue. 2006: 80-87.
[17] Athar A, Teufel S. Context-enhanced Citation Sentiment Detection[C]// Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012: 597-601.
[18] 雷声伟, 陈海华, 黄永, 等. 学术文献引文上下文自动识别研究[J]. 图书情报工作, 2016, 60(17):78-87.
[18] ( Lei Shengwei, Chen Haihua, Huang Yong, et al. Research on Automatic Recognition of Academic Citation Context[J]. Library and Information Service, 2016, 60(17):78-87.)
[19] 章成志, 徐津, 马舒天. 学术文本被引片段的自动识别研究[J]. 情报理论与实践, 2019, 42(9):139-145.
[19] ( Zhang Chengzhi, Xu Jin, Ma Shutian. Automatic Identification of Cited Spans in Academic Articles[J]. Information Studies: Theory & Application, 2019, 42(9):139-145.)
[20] McCarth J F, Lenhner W G. Using Decision Trees for Coreference Resolution[OL]. arXiv Preprint, arXiv: cmp-lg/9505043, 1995.
[21] Soon W M, NG H T, Lim D C Y. A Machine Learning Approach to Coreference Resolution of Noun Phrases[J]. Computational Linguistics, 2001, 27(4):521-544.
doi: 10.1162/089120101753342653
[22] Ng V, Cardie C. Improving Machine Learning Approaches to Coreference Resolution[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002: 104-111.
[23] Lee H, Peirsman Y, Chang A, et al. Stanford’s Multi-pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task[C]// Proceedings of the 15th Conference on Computational Natural Language Learning: Shared Task. 2011: 28-34.
[24] Chen C, Ng V. Chinese Noun Phrase Coreference Resolution: Insights into the State of the Art[C]// Proceedings of COLING 2012. 2012:185-194.
[1] Hyonil Kim,Ou Shiyan. Identifying Citation Texts with Unsupervised Method[J]. 数据分析与知识发现, 2021, 5(1): 66-77.
[2] Jiang Lin,Zhang Qilin. Research on Academic Evaluation Based on Fine-Grain Citation Sentimental Quantification[J]. 数据分析与知识发现, 2020, 4(6): 129-138.
[3] Deng Siyi,Le Xiaoqiu. Coreference Resolution Based on Dynamic Semantic Attention[J]. 数据分析与知识发现, 2020, 4(5): 46-53.
[4] Wang Yi,Shen Zhe,Yao Yifan,Cheng Ying. Domain-Specific Event Graph Construction Methods:A Review[J]. 数据分析与知识发现, 2020, 4(10): 1-13.
[5] Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[6] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[7] Chengzhi Zhang,Zheng Li. Extracting Sentences of Research Originality from Full Text Academic Articles[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[8] Mu Dongmei,Jin Shan,Ju Yuanhong. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[9] Xu Jian,Li Gang,Mao Jin,Ye Guanghui. Recognizing and Analyzing Cited Spans in Literature[J]. 数据分析与知识发现, 2017, 1(11): 37-45.
[10] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[11] Liu Wei, Wang Xing, Song Peiyan. A Noise Cleaning Method for Synonym Extraction Results[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[12] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[13] Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[14] Liu Yajing, Wang Yanxi, Hao Dan, Zhou Jinhui. Study on the Methods of Institutional Repository Supporting Research Services[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[15] Lu Chao, Zhang Chengzhi. Study on the Reference Network of Single Academic Article Based on Citation Content[J]. 现代图书情报技术, 2014, 30(10): 33-41.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn