Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (11): 37-45    DOI: 10.11925/infotech.2096-3467.2017.0606
Orginal Article Current Issue | Archive | Adv Search |
Recognizing and Analyzing Cited Spans in Literature
Xu Jian1(), Li Gang1, Mao Jin1, Ye Guanghui2
1Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
2School of Information Management, Central China Normal University, Wuhan 430079, China
Download: PDF (810 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper analyzes features of the cited document spans, and compares the effectiveness of several recognization techniques. [Methods] Firstly, we analyzed the annotated data of cited spans from CL-SciSumm 2016 for their length and position features as well as correlations with citation contexts. Then, we compared the effectiveness of bag-of-words, topic model, semantic dictionary (WordNet) methods by their performance of recognizing cited spans. [Results] We found that 96% of the annotated cited spans were less than three sentences, and most of the cited spans occurred in the front part of the whole paper or each chapter. The average TextRank weight of these cited spans was significantly higher than that of the regular spans. The length of these cited spans was correlated to the length of their corresponding sections, however, there was no obvious ties with the position features. The method based on bag-of-words was the most effective one, followed by the methods based on semantic similarity and topic model. [Limitations] Our discussion on the conception and characteristics of the cited spans are in theory. All data analysis was done with the annotation dataset of CL-SciSumm 2016. [Conclusions] The choice of words in scientific literature is very formal and rigorous, which makes the lexical features play an important role in recognizing the cited spans.

Key wordsCited Spans      Recognition Method      Citation Context      Citation Object     
Received: 21 June 2017      Published: 27 November 2017
ZTFLH:  G35  

Cite this article:

Xu Jian,Li Gang,Mao Jin,Ye Guanghui. Recognizing and Analyzing Cited Spans in Literature. Data Analysis and Knowledge Discovery, 2017, 1(11): 37-45.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0606     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I11/37

被引文献 施引文献(引用次数)
C00-2123 C02-1050(4)、C04-1091(1)、E06-1004(1)、H01-1062(1)、J03-1005(1)、J04-2003(2)、J04-4002(1)、N03-1010(1)、P01-1027(1)、P03-1039(2)、W01-0505(1)、W01-1404(1)、W01-1407(1)、W01-1408(1)、W02-1020(1)
C02-1025 C10-2104(1)、C10-2167(1)、I05-3013(1)、I05-3030(1)、I08-2080(1)、P02-1061(1)、P03-1028(5)、P05-1045(1)、P05-1051(1)、P06-1141(1)、W03-0423(5)、W03-0432(1)、W04-0705(2)、W06-0119(1)
C04-1089 C10-1070(1)、C10-2164(1)、D10-1042(1)、D12-1003(2)、N09-1048(2)、P06-1011(1)、P13-1059(1)、P13-1062(1)、P13-1107(1)、P13-2036(2)、W11-1215(1)、W11-2206(1)、W13-2501(1)、W13-2512(1)
...
被引片段识别方法 MRR° 句子层面评价指标(Overlap) 词汇层面评价指标(ROUGE_1)
P@3 R@3 F_1@3 P°@3 R°@3 F_1°@3
词袋模型 Cosine 23.7 4.6% 10.4% 6.4% 22.3% 24.4% 23.3%
Jaccard 38.6 8.6% 19.3% 11.9% 27.2% 33.2% 29.9%
主题相似性 K=40 5.3 0.7% 1.2% 0.86% 15.1% 11.7% 13.2%
K=80 6.7 1.3% 2.5% 1.7% 16.7% 14.6% 15.6%
K=120 6.9 0.8% 1.7% 1.1% 16.9% 16.2% 16.5%
K=160 6.4 0.9% 1.8% 1.2% 17.2% 17.5% 17.3%
K=200 6.8 1.2% 2.3% 1.6% 17.4% 17.9% 17.7%
语义相似性 WordNet(N) 14.0 1.2% 4.0% 2.6% 20.5% 19.3% 19.9%
[1] Jaidka K, Chandrasekaren M K, Elizalde B F, et al.The Computational Linguistics Summarization Pilot Task[C]// Proceedings of Text Analysis Conference. 2014.
[2] Jaidka K, Chandrasekaren M K, Rustagi S, et al.Overview of the CL-SciSumm 2016 Shared Task[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[3] Le M H, Ho T B, Nakamori Y.Detecting Emerging Trends from Scientific Corpora[J]. International Journal of Knowledge and Systems Sciences, 2005, 2(2): 53-59.
[4] 祝清松, 冷伏海. 基于引文内容分析的高被引论文主题识别研究[J]. 中国图书馆学报, 2014, 40(1): 39-49.
doi: 10.3969/j.issn.1001-8867.2014.01.004
[4] (Zhu Qingsong, Leng Fuhai.Topic Identification of Highly Cited Papers Based on Citation Context Analysis[J]. Journal of Library Science in China, 2014, 40(1): 39-49.)
doi: 10.3969/j.issn.1001-8867.2014.01.004
[5] Bradshaw S G.Reference Directed Indexing: Indexing Scientific Literature in the Context of Its Use[D]. Evanston, IL, USA: Northwestern University, 2002.
[6] Aljaber B, Stokes N, Bailey J, et al.Document Clustering of Scientific Texts Using Citation Contexts[J]. Information Retrieval, 2010, 13(2): 101-131.
doi: 10.1007/s10791-009-9108-x
[7] Aljaber B, Matinez D, Stokes N, et al.Improving MeSH Classification of Biomedical Articles Using Citation Contexts[J]. Journal of Biomedical Informatics, 2011, 44: 881-896.
doi: 10.1016/j.jbi.2011.05.007 pmid: 21683802
[8] Nanba H, Okumura M.Towards Multi-paper Summarization Using Reference Information[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999.
[9] Qazvinian V, Radev D R.Scientific Paper Summarization Using Citation Summary Networks[C]// Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK. Stroudsburg, PA, USA: Association for Computational Linguistics, 2008: 689-696.
[10] Qazvinian V, Radev D R, Ozgur A.Citation Summarization Through Keyphrase Extraction[C]// Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China. 2010.
[11] 陆伟, 孟睿, 刘兴帮. 面向引用关系的引文内容标注框架研究[J]. 中国图书馆学报, 2014, 40(6): 93-104.
doi: 10.13530/j.cnki.jlis.140029
[11] (Lu Wei, Meng Rui, Liu Xingbang.A Deep Scientific Literature Mining-Oriented Framework for Citation Content Annotation[J]. Journal of Library Science in China, 2014, 40(6): 93-104.)
doi: 10.13530/j.cnki.jlis.140029
[12] 孙枫军. 引文上下文中的概念抽取[D]. 北京:中国科学技术信息研究所, 2012.
[12] (Sun Fengjun.Concept Extraction in Citation Context[D]. Beijing: Institute of Scientific and Technical Information of China, 2012.)
[13] Mei Q, Zhai C X.Generating Impact-Based Summaries for Scientific Literature[C]// Proceedings of the 46th Meeting of the Association for Computational Linguistics: Human Language Technologies. 2008:816-824.
[14] Mollá D, Jones C, Sarkers A.Impact of Citing Papers for Summarisation of Clinical Documents[C]//Proceedings of Australasian Language Technology Association Workshop. 2014.
[15] Cohan A, Soldaini L, Mengle S S R, et al. Towards Citation-based Summarization of Biomedical Literature[C]// Proceedings of the Text Analysis Conference. 2014.
[16] Nomoto T.NEAL: A Neurally Enhanced Approach to Linking Citation and Reference[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[17] Klampfl S, Rexha A, Kern R.Identifying Referenced Text in Scientific Publications by Summarisation and Classification Techniques[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[18] Moraes L, Baki S, Verma R, et al.University of Houston at CL-SciSumm 2016: SVMs with Tree Kernels and Sentence Similarity[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[19] Saggion H, AbuRa’ed A, Ronzano F. Trainable Citation- enhanced Summarization of Scientific Articles[C] // Proceedings of the 2016 Joint Workshop on Bibliometric- enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[20] Cao Z, Li W, Wu D.PolyU at CL-SciSumm 2016[C]// Proceedings of the 2016 Joint Workshop on Bibliometric- enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[21] Lu K, Mao J, Li G, et al.Recognizing Reference Spans and Classifying Their Discourse Facets[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[22] Aggarwal P, Sharma R.Lexical and Syntactic Cues to Identify Reference Scope of Citance[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[23] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [J/OL]. UNT Scholarly Works, 2004. .
[24] Blei D, Carin L, Dunson D.Probabilistic Topic Models[J]. IEEE Signal Processing Magazine, 2010, 27(6): 55-65.
doi: 10.1109/MSP.2009.934715
[25] Miller G A.WordNet: A Lexical Database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.
[26] Lin C Y.ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of Post-Conference Workshop of ACL on Text Summarization Branches Out. 2004.
[1] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[2] Hyonil Kim,Ou Shiyan. Identifying Citation Texts with Unsupervised Method[J]. 数据分析与知识发现, 2021, 5(1): 66-77.
[3] Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn