数据分析与知识发现  2021, Vol. 5 Issue (1): 66-77
南京大学信息管理学院 南京 210023
Identifying Citation Texts with Unsupervised Method
Hyonil Kim,Ou Shiyan()
School of Information Management, Nanjing University, Nanjing 210023, China
【目的】 探索施引文献中引用文本自动识别方法,并比较不同类型引用句在内容上的差别。【方法】 提出一种无监督引用文本识别方法,通过比较候选句与施引文献和被引文献的文本相似度确定隐性引用句。为了精确计算文本相似度,提出向量空间模型与词嵌入模型相结合的两种文档向量模型。【结果】 分别对两篇高被引论文约200篇施引文献中的隐性引用句进行了识别,本文方法的F值均达到92%以上。通过对显性引用句和隐性引用句的内容进行比较,发现两者在引用功能和情感上有明显区别:表达研究背景和技术基础的隐性引用句比例要高于显性引用句,而表达研究基础和研究比较的隐性引用句比例要低于显性引用句;45.3%的显性引用句为正面引用,而78.8%的隐性引用句为中性引用。【局限】 仅对句子层面的引用文本进行识别,在短语层面的引用文本识别还有待于进一步探索。【结论】 在识别引用文本时有必要识别隐性引用句,本文提出的引用文本识别方法性能较高。

关键词 引用文本识别隐性引用句引用内容分析    

[Objective] This paper proposes a method to automatically identify citation texts and compare the contents of citation sentences. [Methods] We developed an unsupervised method to find the implicit citation sentences and then compared the similarity of these sentences and the citing/cited papers. We combined the vector space and the word embedding models to calcuate the similarity precisely. [Results] We identified the implicit citation sentences of two higly-cited papers from 200 citing articles and found the proposed method’s F-value was above 92%. By comparing the contents of the explicit and implicit citaiton senstences, we noticed their significant difference in citation functions and sentiments. There were more implicit citation sentences for research background and technical basis than the explicit ones. There were also fewer implicit citation sentences for research basis and comparison than the explicit ones. 45.3% of the explicit citation sentences were positive references while 78.8% of implicit citation sentences were neutral. [Limitations] We only investigated citation texts at sentence level. More research is needed to discuss the clause and phrase-level identifications.[Conclusions] The proposed method could effectively identify implicit citation sentences.

Key wordsCitation Text Identification    Implicit Citation Sentence    Citation Context Analysis
收稿日期: 2020-06-11      出版日期: 2020-09-02
金贤日,欧石燕. 无监督引用文本自动识别与分析*[J]. 数据分析与知识发现, 2021, 5(1): 66-77.
Hyonil Kim,Ou Shiyan. Identifying Citation Texts with Unsupervised Method. Data Analysis and Knowledge Discovery, 2021, 5(1): 66-77.
Fig.1  包含显性引用句和隐性引用句的引用文本实例
Fig.2  隐性引用句识别框架
语义 示例句
cord 原始:He hung on to his line and landed the fish.
替换:He hung on to his line_cord and landed the fish.
division 原始:Further, blur the legal line separating commercial and investment banking.
替换:Further, blur the legal line_division separating commercial and investment banking.
formation 原始:Correspondent said in the passport line at Moscow’s Sheremetyevo airport.
替换:Correspondent said in the passport line_formation at Moscow’s Sheremetyevo airport.
phone 原始:He made another call and came back on the line with the news that …
替换:He made another call and came back on the line_phone with the news that …
product 原始:In addition, Mr. Frashier will push for development of a line of protein-based adhesive and coating products.
替换:In addition, Mr. Frashier will push for development of a line_product of protein-based adhesive and coating products.
text 原始:Clients reportedly get a one-page bill on which is written a single line.
替换:Clients reportedly get a one-page bill on which is written a single line_text.
Table 1  多义词词向量训练语料(以多义词line为例)
多义词 语义词汇 余弦相似度 多义词 语义词汇 余弦相似度
line line_cord 0.45 interest interest1 0.59
line_division 0.54 interest2 0.62
line_formation 0.48 interest3 0.53
line_phone 0.57 interest4 0.49
line_product 0.92 interest5 0.57
line_text 0.46 interest6 0.86
AWV 0.74 AWV 0.78
TF-AWV 0.96 TF-AWV 0.92
server server2 0.79 hard hard1 0.98
server6 0.62 hard2 0.81
server10 0.79 hard3 0.61
server12 0.78 AWV 0.94
AWV 0.91 TF-AWV 0.99
TF-AWV 0.93
Table 2  多义词真实词向量与基于线性组合模型的预测词向量之间的余弦相似度
文档向量表示模型 简称 隐性引用句与被引参考文献更加相似的比例
文献表示为摘要 文献表示为全文
传统向量空间模型 TFIDF-VSM 69.33% 57.11%
基于TF或TF-IDF权重和词向量的文档向量表示模型 TF-AWV 79.37% 59.20%
TFIDF-AWV 80.32% 70.82%
基于TF-IDF权重和词向量的向量空间模型 PTFIDF-VSM 73.62% 70.65%
Table 3  基于各种文档向量表示模型的隐性引用句与施引/参考文献的相似度比较结果
Fig.3  隐性引用句识别性能随左窗口长度变化的情况(右窗口长度固定为10)
Fig.4  隐性引用句识别性能随右窗口长度变化的情况(左窗口长度固定为2)
文档向量表示模型 简称 文献被表示为摘要 文献被表示为全文
R/% P/% F1/% R/% P/% F1/%
传统向量空间模型 TFIDF-VSM 69.33 97.25 80.95 57.11 100.00 72.70
Doc2Vec模型 PV-DBOW 63.06 84.96 72.39 54.28 97.66 69.77
基于TF或TF-IDF权重和词向量的文档向量模型 TF-AWV 79.78 96.52 87.36 59.20 99.25 74.16
TFIDF-AWV 80.32 99.43 88.86 70.82 98.57 82.42
基于TF-IDF权重和词向量的向量空间模型 PTFIDF-VSM 73.62 98.76 84.36 70.65 100.00 82.80
Table 4  基于各种文档表示模型的隐性引用句识别的性能
组合模式 文献被表示为摘要 文献被表示为全文
P/% R/% F1% P/% R/% F1/%
99.45 90.51 94.77 99.05 87.63 92.99
PTFIDF-VSM + TFIDF-AWV 98.90 90.10 94.30 99.52 88.04 93.43
96.67 89.54 92.97 97.92 80.73 88.49
TFIDF-AWV + TFIDF-VSM 98.47 87.78 92.82 98.76 80.49 88.69
Table 5  基于不同组合模式的隐性引用句识别性能
P/% R/% F1/%
计算机 89 118 214 89.5 97.6 93.4
工程学 65 86 136 91.0 95.9 93.4
物理 25 40 53 89.2 97.8 93.3
医学 24 31 54 89.8 92.9 91.3
其他 22 32 48 80.9 95.0 87.4
合计 225 307 505 - - -
平均 - - - 89.0 96.4 92.6
Table 6  深度神经网络高被引论文的隐性引用句识别结果
施引文献领域 施引文献篇数 显性引用句数量 隐性引用句
P/% R/% F1/%
计算机 92 146 253 97.3 88.7 92.8
工程学 39 58 89 96.3 87.8 91.8
管理学 28 41 82 95.5 87.0 91.1
医学 13 24 45 95.3 88.4 91.7
商学 10 11 22 100.0 90.9 95.2
其他 25 47 90 96.3 88.1 92.0
合计 207 327 581 - - -
平均 - - - 96.6 88.3 92.3
Table 7  LDA主题模型高被引论文的隐性引用句识别结果
引用句类别 “背景”类别 “使用”类别 “基于”类别 “比较”类别
数量 占比
数量 占比
数量 占比
数量 占比
显性引用句 1 223 75.4 306 18.9 60 3.6 33 2.3
隐性引用句 2 762 77.5 755 21.2 12 0.3 34 0.9
Table 8  不同引用功能在显性引用句和隐性引用句中的分布
引用句类别 正面引用 负面引用 中性引用
数量 占比/% 数量 占比/% 数量 占比/%
显性引用句 734 45.3 83 5.1 805 49.6
隐性引用句 546 15.3 208 5.8 2 809 78.8
Table 9  不同引用情感在显性引用句和隐性引用句中的分布
