%A Xu Jian,Li Gang,Mao Jin,Ye Guanghui %T Recognizing and Analyzing Cited Spans in Literature %0 Journal Article %D 2017 %J Data Analysis and Knowledge Discovery %R 10.11925/infotech.2096-3467.2017.0606 %P 37-45 %V 1 %N 11 %U {https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/abstract/article_4441.shtml} %8 2017-11-25 %X

[Objective] This paper analyzes features of the cited document spans, and compares the effectiveness of several recognization techniques. [Methods] Firstly, we analyzed the annotated data of cited spans from CL-SciSumm 2016 for their length and position features as well as correlations with citation contexts. Then, we compared the effectiveness of bag-of-words, topic model, semantic dictionary (WordNet) methods by their performance of recognizing cited spans. [Results] We found that 96% of the annotated cited spans were less than three sentences, and most of the cited spans occurred in the front part of the whole paper or each chapter. The average TextRank weight of these cited spans was significantly higher than that of the regular spans. The length of these cited spans was correlated to the length of their corresponding sections, however, there was no obvious ties with the position features. The method based on bag-of-words was the most effective one, followed by the methods based on semantic similarity and topic model. [Limitations] Our discussion on the conception and characteristics of the cited spans are in theory. All data analysis was done with the annotation dataset of CL-SciSumm 2016. [Conclusions] The choice of words in scientific literature is very formal and rigorous, which makes the lexical features play an important role in recognizing the cited spans.