Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (11): 37-45     https://doi.org/10.11925/infotech.2096-3467.2017.0606
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
文献被引片段特征分析与识别研究
徐健1(), 李纲1, 毛进1, 叶光辉2
1 武汉大学信息资源研究中心 武汉 430072
2华中师范大学信息管理学院 武汉 430079
Recognizing and Analyzing Cited Spans in Literature
Xu Jian1(), Li Gang1, Mao Jin1, Ye Guanghui2
1Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
2School of Information Management, Central China Normal University, Wuhan 430079, China
全文: PDF (810 KB)   HTML ( 1
输出: BibTeX | EndNote (RIS)      
摘要 

目的】对科技文献领域的被引片段概念的特征进行分析, 并比较不同识别方法效果的差异。【方法】以CL-SciSumm 2016比赛被引片段标注数据为例, 探索被引片段长度、位置与重要性特征, 并分析与其对应引文上下文在长度和位置上的相关性。之后以基于词袋模型、主题模型、WordNet语义词典的相似性算法为例, 比较这些方法在被引片段识别中的效果差异。【结果】研究结果发现: 被标注的被引片段有96%少于三句, 且更多地出现在文章前部和章节内的前部分, 被引片段的TextRank权重均值显著高于其他片段; 被引片段与引文上下文在长度上显著相关, 但在出现位置上相关性不明显; 无论从MMR°还是句子与词汇层面的匹配度来看, 基于词袋模型的识别方法效果均优于基于语义词典的方法, 而后者明显优于基于主题模型的方法。【局限】对于被引片段概念与特性的分析只停留在理论层面, 对其特征的分析与有关识别方法的比较也只是在CL-SciSumm 2016被引片段标注数据上进行的。【结论】科技文献的用词比较规范严谨, 所以词汇特征在被引片段的识别过程中起到关键的作用。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐健
李纲
毛进
叶光辉
关键词 被引片段识别方法引文上下文引用对象    
Abstract

[Objective] This paper analyzes features of the cited document spans, and compares the effectiveness of several recognization techniques. [Methods] Firstly, we analyzed the annotated data of cited spans from CL-SciSumm 2016 for their length and position features as well as correlations with citation contexts. Then, we compared the effectiveness of bag-of-words, topic model, semantic dictionary (WordNet) methods by their performance of recognizing cited spans. [Results] We found that 96% of the annotated cited spans were less than three sentences, and most of the cited spans occurred in the front part of the whole paper or each chapter. The average TextRank weight of these cited spans was significantly higher than that of the regular spans. The length of these cited spans was correlated to the length of their corresponding sections, however, there was no obvious ties with the position features. The method based on bag-of-words was the most effective one, followed by the methods based on semantic similarity and topic model. [Limitations] Our discussion on the conception and characteristics of the cited spans are in theory. All data analysis was done with the annotation dataset of CL-SciSumm 2016. [Conclusions] The choice of words in scientific literature is very formal and rigorous, which makes the lexical features play an important role in recognizing the cited spans.

Key wordsCited Spans    Recognition Method    Citation Context    Citation Object
收稿日期: 2017-06-21      出版日期: 2017-11-27
ZTFLH:  G35  
引用本文:   
徐健, 李纲, 毛进, 叶光辉. 文献被引片段特征分析与识别研究[J]. 数据分析与知识发现, 2017, 1(11): 37-45.
Xu Jian,Li Gang,Mao Jin,Ye Guanghui. Recognizing and Analyzing Cited Spans in Literature. Data Analysis and Knowledge Discovery, 2017, 1(11): 37-45.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.0606      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I11/37
  被引片段说明示例
被引文献 施引文献(引用次数)
C00-2123 C02-1050(4)、C04-1091(1)、E06-1004(1)、H01-1062(1)、J03-1005(1)、J04-2003(2)、J04-4002(1)、N03-1010(1)、P01-1027(1)、P03-1039(2)、W01-0505(1)、W01-1404(1)、W01-1407(1)、W01-1408(1)、W02-1020(1)
C02-1025 C10-2104(1)、C10-2167(1)、I05-3013(1)、I05-3030(1)、I08-2080(1)、P02-1061(1)、P03-1028(5)、P05-1045(1)、P05-1051(1)、P06-1141(1)、W03-0423(5)、W03-0432(1)、W04-0705(2)、W06-0119(1)
C04-1089 C10-1070(1)、C10-2164(1)、D10-1042(1)、D12-1003(2)、N09-1048(2)、P06-1011(1)、P13-1059(1)、P13-1062(1)、P13-1107(1)、P13-2036(2)、W11-1215(1)、W11-2206(1)、W13-2501(1)、W13-2512(1)
...
  CL-SciSumm标注数据来源
  被引片段位置特征分析结果
被引片段识别方法 MRR° 句子层面评价指标(Overlap) 词汇层面评价指标(ROUGE_1)
P@3 R@3 F_1@3 P°@3 R°@3 F_1°@3
词袋模型 Cosine 23.7 4.6% 10.4% 6.4% 22.3% 24.4% 23.3%
Jaccard 38.6 8.6% 19.3% 11.9% 27.2% 33.2% 29.9%
主题相似性 K=40 5.3 0.7% 1.2% 0.86% 15.1% 11.7% 13.2%
K=80 6.7 1.3% 2.5% 1.7% 16.7% 14.6% 15.6%
K=120 6.9 0.8% 1.7% 1.1% 16.9% 16.2% 16.5%
K=160 6.4 0.9% 1.8% 1.2% 17.2% 17.5% 17.3%
K=200 6.8 1.2% 2.3% 1.6% 17.4% 17.9% 17.7%
语义相似性 WordNet(N) 14.0 1.2% 4.0% 2.6% 20.5% 19.3% 19.9%
  不同被引片段识别方法效果对比
[1] Jaidka K, Chandrasekaren M K, Elizalde B F, et al.The Computational Linguistics Summarization Pilot Task[C]// Proceedings of Text Analysis Conference. 2014.
[2] Jaidka K, Chandrasekaren M K, Rustagi S, et al.Overview of the CL-SciSumm 2016 Shared Task[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[3] Le M H, Ho T B, Nakamori Y.Detecting Emerging Trends from Scientific Corpora[J]. International Journal of Knowledge and Systems Sciences, 2005, 2(2): 53-59.
[4] 祝清松, 冷伏海. 基于引文内容分析的高被引论文主题识别研究[J]. 中国图书馆学报, 2014, 40(1): 39-49.
doi: 10.3969/j.issn.1001-8867.2014.01.004
[4] (Zhu Qingsong, Leng Fuhai.Topic Identification of Highly Cited Papers Based on Citation Context Analysis[J]. Journal of Library Science in China, 2014, 40(1): 39-49.)
doi: 10.3969/j.issn.1001-8867.2014.01.004
[5] Bradshaw S G.Reference Directed Indexing: Indexing Scientific Literature in the Context of Its Use[D]. Evanston, IL, USA: Northwestern University, 2002.
[6] Aljaber B, Stokes N, Bailey J, et al.Document Clustering of Scientific Texts Using Citation Contexts[J]. Information Retrieval, 2010, 13(2): 101-131.
doi: 10.1007/s10791-009-9108-x
[7] Aljaber B, Matinez D, Stokes N, et al.Improving MeSH Classification of Biomedical Articles Using Citation Contexts[J]. Journal of Biomedical Informatics, 2011, 44: 881-896.
doi: 10.1016/j.jbi.2011.05.007 pmid: 21683802
[8] Nanba H, Okumura M.Towards Multi-paper Summarization Using Reference Information[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999.
[9] Qazvinian V, Radev D R.Scientific Paper Summarization Using Citation Summary Networks[C]// Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK. Stroudsburg, PA, USA: Association for Computational Linguistics, 2008: 689-696.
[10] Qazvinian V, Radev D R, Ozgur A.Citation Summarization Through Keyphrase Extraction[C]// Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China. 2010.
[11] 陆伟, 孟睿, 刘兴帮. 面向引用关系的引文内容标注框架研究[J]. 中国图书馆学报, 2014, 40(6): 93-104.
doi: 10.13530/j.cnki.jlis.140029
[11] (Lu Wei, Meng Rui, Liu Xingbang.A Deep Scientific Literature Mining-Oriented Framework for Citation Content Annotation[J]. Journal of Library Science in China, 2014, 40(6): 93-104.)
doi: 10.13530/j.cnki.jlis.140029
[12] 孙枫军. 引文上下文中的概念抽取[D]. 北京:中国科学技术信息研究所, 2012.
[12] (Sun Fengjun.Concept Extraction in Citation Context[D]. Beijing: Institute of Scientific and Technical Information of China, 2012.)
[13] Mei Q, Zhai C X.Generating Impact-Based Summaries for Scientific Literature[C]// Proceedings of the 46th Meeting of the Association for Computational Linguistics: Human Language Technologies. 2008:816-824.
[14] Mollá D, Jones C, Sarkers A.Impact of Citing Papers for Summarisation of Clinical Documents[C]//Proceedings of Australasian Language Technology Association Workshop. 2014.
[15] Cohan A, Soldaini L, Mengle S S R, et al. Towards Citation-based Summarization of Biomedical Literature[C]// Proceedings of the Text Analysis Conference. 2014.
[16] Nomoto T.NEAL: A Neurally Enhanced Approach to Linking Citation and Reference[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[17] Klampfl S, Rexha A, Kern R.Identifying Referenced Text in Scientific Publications by Summarisation and Classification Techniques[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[18] Moraes L, Baki S, Verma R, et al.University of Houston at CL-SciSumm 2016: SVMs with Tree Kernels and Sentence Similarity[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[19] Saggion H, AbuRa’ed A, Ronzano F. Trainable Citation- enhanced Summarization of Scientific Articles[C] // Proceedings of the 2016 Joint Workshop on Bibliometric- enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[20] Cao Z, Li W, Wu D.PolyU at CL-SciSumm 2016[C]// Proceedings of the 2016 Joint Workshop on Bibliometric- enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[21] Lu K, Mao J, Li G, et al.Recognizing Reference Spans and Classifying Their Discourse Facets[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[22] Aggarwal P, Sharma R.Lexical and Syntactic Cues to Identify Reference Scope of Citance[C]// Proceedings of the 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries. 2016.
[23] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [J/OL]. UNT Scholarly Works, 2004. .
[24] Blei D, Carin L, Dunson D.Probabilistic Topic Models[J]. IEEE Signal Processing Magazine, 2010, 27(6): 55-65.
doi: 10.1109/MSP.2009.934715
[25] Miller G A.WordNet: A Lexical Database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.
[26] Lin C Y.ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of Post-Conference Workshop of ACL on Text Summarization Branches Out. 2004.
[1] 马娜,张智雄,吴朋民. 基于特征融合的术语型引用对象自动识别方法研究*[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
[2] 吴佳芬,马费成. 产品虚假评论文本识别方法研究述评 *[J]. 数据分析与知识发现, 2019, 3(9): 1-15.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn