Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (1): 66-77    DOI: 10.11925/infotech.2096-3467.2020.0548
Current Issue | Archive | Adv Search |
Identifying Citation Texts with Unsupervised Method
Hyonil Kim,Ou Shiyan()
School of Information Management, Nanjing University, Nanjing 210023, China
Download: PDF (934 KB)   HTML ( 9
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a method to automatically identify citation texts and compare the contents of citation sentences. [Methods] We developed an unsupervised method to find the implicit citation sentences and then compared the similarity of these sentences and the citing/cited papers. We combined the vector space and the word embedding models to calcuate the similarity precisely. [Results] We identified the implicit citation sentences of two higly-cited papers from 200 citing articles and found the proposed method’s F-value was above 92%. By comparing the contents of the explicit and implicit citaiton senstences, we noticed their significant difference in citation functions and sentiments. There were more implicit citation sentences for research background and technical basis than the explicit ones. There were also fewer implicit citation sentences for research basis and comparison than the explicit ones. 45.3% of the explicit citation sentences were positive references while 78.8% of implicit citation sentences were neutral. [Limitations] We only investigated citation texts at sentence level. More research is needed to discuss the clause and phrase-level identifications.[Conclusions] The proposed method could effectively identify implicit citation sentences.

Key wordsCitation Text Identification      Implicit Citation Sentence      Citation Context Analysis     
Received: 11 June 2020      Published: 02 September 2020
ZTFLH:  TP393  
Fund:The work is supported by the National Social Science Fund of China Grant No(17ATQ001)
Corresponding Authors: Ou Shiyan     E-mail: oushiyan@nju.edu.cn

Cite this article:

Hyonil Kim,Ou Shiyan. Identifying Citation Texts with Unsupervised Method. Data Analysis and Knowledge Discovery, 2021, 5(1): 66-77.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0548     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I1/66

A Citation Text Sample that Contains the Explicit and Implicit Citation Sentences
A Identification Framework for Implicit Citation Sentence
语义 示例句
cord 原始:He hung on to his line and landed the fish.
替换:He hung on to his line_cord and landed the fish.
division 原始:Further, blur the legal line separating commercial and investment banking.
替换:Further, blur the legal line_division separating commercial and investment banking.
formation 原始:Correspondent said in the passport line at Moscow’s Sheremetyevo airport.
替换:Correspondent said in the passport line_formation at Moscow’s Sheremetyevo airport.
phone 原始:He made another call and came back on the line with the news that …
替换:He made another call and came back on the line_phone with the news that …
product 原始:In addition, Mr. Frashier will push for development of a line of protein-based adhesive and coating products.
替换:In addition, Mr. Frashier will push for development of a line_product of protein-based adhesive and coating products.
text 原始:Clients reportedly get a one-page bill on which is written a single line.
替换:Clients reportedly get a one-page bill on which is written a single line_text.
Training Corpus for Multi-Sense Word Vectors (Taking “line” as an Example)
多义词 语义词汇 余弦相似度 多义词 语义词汇 余弦相似度
line line_cord 0.45 interest interest1 0.59
line_division 0.54 interest2 0.62
line_formation 0.48 interest3 0.53
line_phone 0.57 interest4 0.49
line_product 0.92 interest5 0.57
line_text 0.46 interest6 0.86
AWV 0.74 AWV 0.78
TF-AWV 0.96 TF-AWV 0.92
server server2 0.79 hard hard1 0.98
server6 0.62 hard2 0.81
server10 0.79 hard3 0.61
server12 0.78 AWV 0.94
AWV 0.91 TF-AWV 0.99
TF-AWV 0.93
The Cosine Similarity Between the Real Word Vector of a Multi-Sense Word and its Predicted Word Vectors Based on the Two Linear Combination Models
文档向量表示模型 简称 隐性引用句与被引参考文献更加相似的比例
文献表示为摘要 文献表示为全文
传统向量空间模型 TFIDF-VSM 69.33% 57.11%
基于TF或TF-IDF权重和词向量的文档向量表示模型 TF-AWV 79.37% 59.20%
TFIDF-AWV 80.32% 70.82%
基于TF-IDF权重和词向量的向量空间模型 PTFIDF-VSM 73.62% 70.65%
Similarity Between Implicit Citation Sentences and Citing/Cited Papers Based on Various Document Vector Models
The Performance of the Implicit Sentence Identification with Different Left Boundaries While the Right Boundary is Fixed to 10
The Performance of the Implicit Sentence Identification with Different Right Boundaries While the Left Boundary is Fixed to 2
文档向量表示模型 简称 文献被表示为摘要 文献被表示为全文
R/% P/% F1/% R/% P/% F1/%
传统向量空间模型 TFIDF-VSM 69.33 97.25 80.95 57.11 100.00 72.70
Doc2Vec模型 PV-DBOW 63.06 84.96 72.39 54.28 97.66 69.77
基于TF或TF-IDF权重和词向量的文档向量模型 TF-AWV 79.78 96.52 87.36 59.20 99.25 74.16
TFIDF-AWV 80.32 99.43 88.86 70.82 98.57 82.42
基于TF-IDF权重和词向量的向量空间模型 PTFIDF-VSM 73.62 98.76 84.36 70.65 100.00 82.80
The Identification Performance of the Implicit Citation Sentences Based on Various Document Vector Models
组合模式 文献被表示为摘要 文献被表示为全文
P/% R/% F1% P/% R/% F1/%
TFIDF-AWV +
PTFIDF-VSM
99.45 90.51 94.77 99.05 87.63 92.99
PTFIDF-VSM + TFIDF-AWV 98.90 90.10 94.30 99.52 88.04 93.43
TFIDF-AWV +
PV-DBOW
96.67 89.54 92.97 97.92 80.73 88.49
TFIDF-AWV + TFIDF-VSM 98.47 87.78 92.82 98.76 80.49 88.69
The Identification Performance of the Implicit Citation Sentences Based on Various Hybrid Models
施引文
献领域
施引文
献篇数
显性引用
句数量
隐性引用句
数量
隐性引用句识别结果
P/% R/% F1/%
计算机 89 118 214 89.5 97.6 93.4
工程学 65 86 136 91.0 95.9 93.4
物理 25 40 53 89.2 97.8 93.3
医学 24 31 54 89.8 92.9 91.3
其他 22 32 48 80.9 95.0 87.4
合计 225 307 505 - - -
平均 - - - 89.0 96.4 92.6
The Identification Performance of the Implicit Citation Sentences of the Highly-Cited Paper on Deep Neural Network
施引文献领域 施引文献篇数 显性引用句数量 隐性引用句
数量
隐性引用句识别结果
P/% R/% F1/%
计算机 92 146 253 97.3 88.7 92.8
工程学 39 58 89 96.3 87.8 91.8
管理学 28 41 82 95.5 87.0 91.1
医学 13 24 45 95.3 88.4 91.7
商学 10 11 22 100.0 90.9 95.2
其他 25 47 90 96.3 88.1 92.0
合计 207 327 581 - - -
平均 - - - 96.6 88.3 92.3
The Identification Performance of the Implicit Citation Sentences of the Highly-Cited Paper on LDA Topic Model
引用句类别 “背景”类别 “使用”类别 “基于”类别 “比较”类别
数量 占比
/%
数量 占比
/%
数量 占比
/%
数量 占比
/%
显性引用句 1 223 75.4 306 18.9 60 3.6 33 2.3
隐性引用句 2 762 77.5 755 21.2 12 0.3 34 0.9
The Citation Function Distribution in Explicit Citation Sentences and Implicit Citation Sentences
引用句类别 正面引用 负面引用 中性引用
数量 占比/% 数量 占比/% 数量 占比/%
显性引用句 734 45.3 83 5.1 805 49.6
隐性引用句 546 15.3 208 5.8 2 809 78.8
The Citation Sentiment Distribution in Explicit Citation Sentences and Implicit Citation Sentences
[1] Chen C M . Eugene Garfield’s Scholarly Impact: A Scientometric Review[J]. Scientometrics, 2018,114(2):489-516.
[2] 刘浏, 王东波 . 引用内容分析研究综述[J]. 情报学报, 2017,36(6):637-643.
[2] ( Liu Liu, Wang Dongbo . Review on Citation Context Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(6):637-643.)
[3] 陈颖芳, 马晓雷 . 基于引用内容与功能分析的科学知识发展演进规律研究[J]. 情报杂志, 2020,39(3):71-80.
[3] ( Chen Yingfang, Ma Xiaolei . Measuring the Developmental Trend of a Knowledge Domain Through Citation Content and Citation Function Analysis[J]. Journal of Intelligence, 2020,39(3):71-80.)
[4] Tahamtan I, Bornmann L . What do Citation Counts Measure? An Updated Review of Studies on Citations in Scientific Documents Published Between 2006 and 2018[J]. Scientometrics, 2019,121(3):1635-1684.
[5] 吴素研, 吴江瑞, 李文波 . 大规模科技文献深度解析和检索平台构建[J]. 现代情报, 2020,40(1):110-115.
[5] ( Wu Suyan, Wu Jiangrui, Li Wenbo . Construction of Deep Resolution and Retrieval Platform for Large Scale Scientific and Technical Literature[J]. Journal of Modern Information, 2020,40(1):110-115.)
[6] 雷声伟, 陈海华, 黄永 , 等. 学术文献引文上下文自动识别研究[J]. 图书情报工作, 2016,60(17):78-87.
[6] ( Lei Shengwei, Chen Haihua, Huang Yong , et al. Research on Automatic Recognition of Academic Citation Context[J]. Library and Information Service, 2016,60(17):78-87.)
[7] Bradshaw S. Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes[C]// Proceedings of the 7th International Conference on Theory and Practice of Digital Libraries (ECDL 2003). Heidelberg, Berlin: Springer, 2003: 499-510.
[8] Ritchie A, Robertson S, Teufel S, et al. Comparing Citation Contexts for Information Retrieval[C]// Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York, NY: Association for Computing Machinery, 2008: 213-222.
[9] O’connor J . Citing Statements: Computer Recognition and Use to Improve Retrieval[J]. Information Processing and Management, 1982,18(3):125-131.
[10] Nanba H, Okumura M. Towards Multi-Paper Summarization Using Reference Information[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1999: 926-931.
[11] Kaplan D, Iida R, Tokunaga T. Automatic Extraction of Citation Contexts for Research Paper Summarization: A Coreference-Chain Based Approach[C]// Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL). 2009: 88-95.
[12] Angrosh M A, Cranefield S, Stanger N, et al. Context Identification of Sentences in Related Work Sections Using a Conditional Random Field: Towards Intelligent Digital Libraries[C]// Proceedings of the 10th Joint Conference on Digital Libraries (JCDL). New York, NY: Association for Computing Machinery, 2010: 293-302.
[13] Athar A. Sentiment Analysis of Citations Using Sentence Structure-Based Features[C]// Proceedings of the ACL-HLT 2011 Student Session. Stroudsburg, PA: Association for Computational Linguistics, 2011: 81-87.
[14] Sondhi P, Zhai C X. A Constrained Hidden Markov Model Approach for Non-Explicit Citation Context Extraction[C]// Proceedings of the 2014 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2014: 361-369.
[15] Qazvinian V, Radev D R. Identifying Non-Explicit Citing Sentences for Citation-Based Summarization[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2010: 555-564.
[16] Jebari C, Cobo M J, Herreraviedma E, et al. A New Approach for Implicit Citation Extraction[C]// Proceedings of the 19th International Conference on Intelligent Data Engineering and Automated Learning. Cham, Switzerland: Springer, 2018: 121-129.
[17] Mikolov T, Chen K, Corrado G S , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[18] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[19] Dong C, Schafer U. Ensemble-Style Self-Training on Citation Classification[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 623-631.
[20] 凌洪飞 . 基于引文文本自动分类的引用内容分析研究[D]. 南京: 南京大学, 2020.
[20] ( Ling Hongfei . A Study on Citation Context Analysis Based on Automatic Citation Text Classification[D]. Nanjing: Nanjing University, 2020.)
[1] Yu Shuo,Hayat Dino Bedru,Chu Xinbei,Yuan Yuyuan,Wan Liangtian,Xia Feng. Understanding Serendipity in Science: A Survey[J]. 数据分析与知识发现, 2021, 5(1): 16-35.
[2] Yu Fengchang,Cheng Qikai,Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
[3] Zhang Chunjin,Guo Shenghui,Ji Shujuan,Yang Wei,Yi Lei. Group Recommendation Algorithms Based on Implicit Representation Learning of Multi-attribute Ratings[J]. 数据分析与知识发现, 2020, 4(12): 120-135.
[4] Chen Xianlai, Luo Xiao, Liu Li, Li Zhongmin, An Ying. k-Anonymity Algorithm of Multi-Branch-Tree Forest Based on Recognition Rate[J]. 数据分析与知识发现, 2020, 4(12): 14-25.
[5] Wang Gensheng,Pan Fangzheng. Matrix Factorization Algorithm with Weighted Heterogeneous Information Network[J]. 数据分析与知识发现, 2020, 4(12): 76-84.
[6] Qi Ruihua,Jian Yue,Guo Xu,Guan Jinghua,Yang Mingxin. Sentiment Analysis of Cross-Domain Product Reviews Based on Feature Fusion and Attention Mechanism[J]. 数据分析与知识发现, 2020, 4(12): 85-94.
[7] Li Jiao,Huang Yongwen,Luo Tingting,Zhao Ruixue,Xian Guojian. Automatic Classification Method Based on Multi-factor Algorithm[J]. 数据分析与知识发现, 2020, 4(11): 43-51.
[8] Qin Chenglei,Zhang Chengzhi. Recognizing Structure Functions of Academic Articles with Hierarchical Attention Network[J]. 数据分析与知识发现, 2020, 4(11): 26-42.
[9] Chen Xianlai, Luo Xiao, Liu Li, Li Zhongmin, An Ying. k-Anonymity Algorithm of multi-branch-tree Forest Based on Recognition Rate [J]. 数据分析与知识发现, 0, (): 1-.
[10] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[11] Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[12] Li Jiaquan,Li Baoan,You Xindong,Lü Xueqiang. Computing Similarity of Patent Terms Based on Knowledge Graph[J]. 数据分析与知识发现, 2020, 4(10): 104-112.
[13] Wang Xiwei,Zhang Liu,Huang Bo,Wei Ya’nan. Constructing Topic Graph for Weibo Users Based on LDA: Case Study of “Egypt Air Disaster”[J]. 数据分析与知识发现, 2020, 4(10): 47-57.
[14] Ding Heng,Li Yingxuan. Improving Online Q&A Service with Deep Learning[J]. 数据分析与知识发现, 2020, 4(10): 37-46.
[15] Sifan Zhang, Zhendong Niu, Hao Lu, Yifan Zhu, Rongrong Wang. Graph Convolution Embedding and Feature Cross Based Literature Citation Prediction Method:Taking the Transportation Field as An Example [J]. 数据分析与知识发现, 0, (): 1-.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn