Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (1): 66-77     https://doi.org/10.11925/infotech.2096-3467.2020.0548
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
无监督引用文本自动识别与分析*
金贤日,欧石燕()
南京大学信息管理学院 南京 210023
Identifying Citation Texts with Unsupervised Method
Hyonil Kim,Ou Shiyan()
School of Information Management, Nanjing University, Nanjing 210023, China
全文: PDF (934 KB)   HTML ( 24
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 探索施引文献中引用文本自动识别方法,并比较不同类型引用句在内容上的差别。【方法】 提出一种无监督引用文本识别方法,通过比较候选句与施引文献和被引文献的文本相似度确定隐性引用句。为了精确计算文本相似度,提出向量空间模型与词嵌入模型相结合的两种文档向量模型。【结果】 分别对两篇高被引论文约200篇施引文献中的隐性引用句进行了识别,本文方法的F值均达到92%以上。通过对显性引用句和隐性引用句的内容进行比较,发现两者在引用功能和情感上有明显区别:表达研究背景和技术基础的隐性引用句比例要高于显性引用句,而表达研究基础和研究比较的隐性引用句比例要低于显性引用句;45.3%的显性引用句为正面引用,而78.8%的隐性引用句为中性引用。【局限】 仅对句子层面的引用文本进行识别,在短语层面的引用文本识别还有待于进一步探索。【结论】 在识别引用文本时有必要识别隐性引用句,本文提出的引用文本识别方法性能较高。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
金贤日
欧石燕
关键词 引用文本识别隐性引用句引用内容分析    
Abstract

[Objective] This paper proposes a method to automatically identify citation texts and compare the contents of citation sentences. [Methods] We developed an unsupervised method to find the implicit citation sentences and then compared the similarity of these sentences and the citing/cited papers. We combined the vector space and the word embedding models to calcuate the similarity precisely. [Results] We identified the implicit citation sentences of two higly-cited papers from 200 citing articles and found the proposed method’s F-value was above 92%. By comparing the contents of the explicit and implicit citaiton senstences, we noticed their significant difference in citation functions and sentiments. There were more implicit citation sentences for research background and technical basis than the explicit ones. There were also fewer implicit citation sentences for research basis and comparison than the explicit ones. 45.3% of the explicit citation sentences were positive references while 78.8% of implicit citation sentences were neutral. [Limitations] We only investigated citation texts at sentence level. More research is needed to discuss the clause and phrase-level identifications.[Conclusions] The proposed method could effectively identify implicit citation sentences.

Key wordsCitation Text Identification    Implicit Citation Sentence    Citation Context Analysis
收稿日期: 2020-06-11      出版日期: 2020-09-02
ZTFLH:  TP393  
基金资助:*本文系国家社会科学基金重点项目的研究成果之一项目编号(17ATQ001)
通讯作者: 欧石燕     E-mail: oushiyan@nju.edu.cn
引用本文:   
金贤日,欧石燕. 无监督引用文本自动识别与分析*[J]. 数据分析与知识发现, 2021, 5(1): 66-77.
Hyonil Kim,Ou Shiyan. Identifying Citation Texts with Unsupervised Method. Data Analysis and Knowledge Discovery, 2021, 5(1): 66-77.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0548      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I1/66
Fig.1  包含显性引用句和隐性引用句的引用文本实例
Fig.2  隐性引用句识别框架
语义 示例句
cord 原始:He hung on to his line and landed the fish.
替换:He hung on to his line_cord and landed the fish.
division 原始:Further, blur the legal line separating commercial and investment banking.
替换:Further, blur the legal line_division separating commercial and investment banking.
formation 原始:Correspondent said in the passport line at Moscow’s Sheremetyevo airport.
替换:Correspondent said in the passport line_formation at Moscow’s Sheremetyevo airport.
phone 原始:He made another call and came back on the line with the news that …
替换:He made another call and came back on the line_phone with the news that …
product 原始:In addition, Mr. Frashier will push for development of a line of protein-based adhesive and coating products.
替换:In addition, Mr. Frashier will push for development of a line_product of protein-based adhesive and coating products.
text 原始:Clients reportedly get a one-page bill on which is written a single line.
替换:Clients reportedly get a one-page bill on which is written a single line_text.
Table 1  多义词词向量训练语料(以多义词line为例)
多义词 语义词汇 余弦相似度 多义词 语义词汇 余弦相似度
line line_cord 0.45 interest interest1 0.59
line_division 0.54 interest2 0.62
line_formation 0.48 interest3 0.53
line_phone 0.57 interest4 0.49
line_product 0.92 interest5 0.57
line_text 0.46 interest6 0.86
AWV 0.74 AWV 0.78
TF-AWV 0.96 TF-AWV 0.92
server server2 0.79 hard hard1 0.98
server6 0.62 hard2 0.81
server10 0.79 hard3 0.61
server12 0.78 AWV 0.94
AWV 0.91 TF-AWV 0.99
TF-AWV 0.93
Table 2  多义词真实词向量与基于线性组合模型的预测词向量之间的余弦相似度
文档向量表示模型 简称 隐性引用句与被引参考文献更加相似的比例
文献表示为摘要 文献表示为全文
传统向量空间模型 TFIDF-VSM 69.33% 57.11%
基于TF或TF-IDF权重和词向量的文档向量表示模型 TF-AWV 79.37% 59.20%
TFIDF-AWV 80.32% 70.82%
基于TF-IDF权重和词向量的向量空间模型 PTFIDF-VSM 73.62% 70.65%
Table 3  基于各种文档向量表示模型的隐性引用句与施引/参考文献的相似度比较结果
Fig.3  隐性引用句识别性能随左窗口长度变化的情况(右窗口长度固定为10)
Fig.4  隐性引用句识别性能随右窗口长度变化的情况(左窗口长度固定为2)
文档向量表示模型 简称 文献被表示为摘要 文献被表示为全文
R/% P/% F1/% R/% P/% F1/%
传统向量空间模型 TFIDF-VSM 69.33 97.25 80.95 57.11 100.00 72.70
Doc2Vec模型 PV-DBOW 63.06 84.96 72.39 54.28 97.66 69.77
基于TF或TF-IDF权重和词向量的文档向量模型 TF-AWV 79.78 96.52 87.36 59.20 99.25 74.16
TFIDF-AWV 80.32 99.43 88.86 70.82 98.57 82.42
基于TF-IDF权重和词向量的向量空间模型 PTFIDF-VSM 73.62 98.76 84.36 70.65 100.00 82.80
Table 4  基于各种文档表示模型的隐性引用句识别的性能
组合模式 文献被表示为摘要 文献被表示为全文
P/% R/% F1% P/% R/% F1/%
TFIDF-AWV +
PTFIDF-VSM
99.45 90.51 94.77 99.05 87.63 92.99
PTFIDF-VSM + TFIDF-AWV 98.90 90.10 94.30 99.52 88.04 93.43
TFIDF-AWV +
PV-DBOW
96.67 89.54 92.97 97.92 80.73 88.49
TFIDF-AWV + TFIDF-VSM 98.47 87.78 92.82 98.76 80.49 88.69
Table 5  基于不同组合模式的隐性引用句识别性能
施引文
献领域
施引文
献篇数
显性引用
句数量
隐性引用句
数量
隐性引用句识别结果
P/% R/% F1/%
计算机 89 118 214 89.5 97.6 93.4
工程学 65 86 136 91.0 95.9 93.4
物理 25 40 53 89.2 97.8 93.3
医学 24 31 54 89.8 92.9 91.3
其他 22 32 48 80.9 95.0 87.4
合计 225 307 505 - - -
平均 - - - 89.0 96.4 92.6
Table 6  深度神经网络高被引论文的隐性引用句识别结果
施引文献领域 施引文献篇数 显性引用句数量 隐性引用句
数量
隐性引用句识别结果
P/% R/% F1/%
计算机 92 146 253 97.3 88.7 92.8
工程学 39 58 89 96.3 87.8 91.8
管理学 28 41 82 95.5 87.0 91.1
医学 13 24 45 95.3 88.4 91.7
商学 10 11 22 100.0 90.9 95.2
其他 25 47 90 96.3 88.1 92.0
合计 207 327 581 - - -
平均 - - - 96.6 88.3 92.3
Table 7  LDA主题模型高被引论文的隐性引用句识别结果
引用句类别 “背景”类别 “使用”类别 “基于”类别 “比较”类别
数量 占比
/%
数量 占比
/%
数量 占比
/%
数量 占比
/%
显性引用句 1 223 75.4 306 18.9 60 3.6 33 2.3
隐性引用句 2 762 77.5 755 21.2 12 0.3 34 0.9
Table 8  不同引用功能在显性引用句和隐性引用句中的分布
引用句类别 正面引用 负面引用 中性引用
数量 占比/% 数量 占比/% 数量 占比/%
显性引用句 734 45.3 83 5.1 805 49.6
隐性引用句 546 15.3 208 5.8 2 809 78.8
Table 9  不同引用情感在显性引用句和隐性引用句中的分布
[1] Chen C M . Eugene Garfield’s Scholarly Impact: A Scientometric Review[J]. Scientometrics, 2018,114(2):489-516.
[2] 刘浏, 王东波 . 引用内容分析研究综述[J]. 情报学报, 2017,36(6):637-643.
[2] ( Liu Liu, Wang Dongbo . Review on Citation Context Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(6):637-643.)
[3] 陈颖芳, 马晓雷 . 基于引用内容与功能分析的科学知识发展演进规律研究[J]. 情报杂志, 2020,39(3):71-80.
[3] ( Chen Yingfang, Ma Xiaolei . Measuring the Developmental Trend of a Knowledge Domain Through Citation Content and Citation Function Analysis[J]. Journal of Intelligence, 2020,39(3):71-80.)
[4] Tahamtan I, Bornmann L . What do Citation Counts Measure? An Updated Review of Studies on Citations in Scientific Documents Published Between 2006 and 2018[J]. Scientometrics, 2019,121(3):1635-1684.
[5] 吴素研, 吴江瑞, 李文波 . 大规模科技文献深度解析和检索平台构建[J]. 现代情报, 2020,40(1):110-115.
[5] ( Wu Suyan, Wu Jiangrui, Li Wenbo . Construction of Deep Resolution and Retrieval Platform for Large Scale Scientific and Technical Literature[J]. Journal of Modern Information, 2020,40(1):110-115.)
[6] 雷声伟, 陈海华, 黄永 , 等. 学术文献引文上下文自动识别研究[J]. 图书情报工作, 2016,60(17):78-87.
[6] ( Lei Shengwei, Chen Haihua, Huang Yong , et al. Research on Automatic Recognition of Academic Citation Context[J]. Library and Information Service, 2016,60(17):78-87.)
[7] Bradshaw S. Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes[C]// Proceedings of the 7th International Conference on Theory and Practice of Digital Libraries (ECDL 2003). Heidelberg, Berlin: Springer, 2003: 499-510.
[8] Ritchie A, Robertson S, Teufel S, et al. Comparing Citation Contexts for Information Retrieval[C]// Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York, NY: Association for Computing Machinery, 2008: 213-222.
[9] O’connor J . Citing Statements: Computer Recognition and Use to Improve Retrieval[J]. Information Processing and Management, 1982,18(3):125-131.
[10] Nanba H, Okumura M. Towards Multi-Paper Summarization Using Reference Information[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1999: 926-931.
[11] Kaplan D, Iida R, Tokunaga T. Automatic Extraction of Citation Contexts for Research Paper Summarization: A Coreference-Chain Based Approach[C]// Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL). 2009: 88-95.
[12] Angrosh M A, Cranefield S, Stanger N, et al. Context Identification of Sentences in Related Work Sections Using a Conditional Random Field: Towards Intelligent Digital Libraries[C]// Proceedings of the 10th Joint Conference on Digital Libraries (JCDL). New York, NY: Association for Computing Machinery, 2010: 293-302.
[13] Athar A. Sentiment Analysis of Citations Using Sentence Structure-Based Features[C]// Proceedings of the ACL-HLT 2011 Student Session. Stroudsburg, PA: Association for Computational Linguistics, 2011: 81-87.
[14] Sondhi P, Zhai C X. A Constrained Hidden Markov Model Approach for Non-Explicit Citation Context Extraction[C]// Proceedings of the 2014 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2014: 361-369.
[15] Qazvinian V, Radev D R. Identifying Non-Explicit Citing Sentences for Citation-Based Summarization[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2010: 555-564.
[16] Jebari C, Cobo M J, Herreraviedma E, et al. A New Approach for Implicit Citation Extraction[C]// Proceedings of the 19th International Conference on Intelligent Data Engineering and Automated Learning. Cham, Switzerland: Springer, 2018: 121-129.
[17] Mikolov T, Chen K, Corrado G S , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[18] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[19] Dong C, Schafer U. Ensemble-Style Self-Training on Citation Classification[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 623-631.
[20] 凌洪飞 . 基于引文文本自动分类的引用内容分析研究[D]. 南京: 南京大学, 2020.
[20] ( Ling Hongfei . A Study on Citation Context Analysis Based on Automatic Citation Text Classification[D]. Nanjing: Nanjing University, 2020.)
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 李文娜,张智雄. 基于置信学习的知识库错误检测方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[3] 孙羽, 裘江南. 基于网络分析和文本挖掘的意见领袖影响力研究 [J]. 数据分析与知识发现, 0, (): 1-.
[4] 王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[5] 李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[6] 王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[7] 杨晗迅, 周德群, 马静, 罗永聪. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究*[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[8] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[10] 王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[11] 阮小芸,廖健斌,李祥,杨阳,李岱峰. 基于人才知识图谱推理的强化学习可解释推荐研究*[J]. 数据分析与知识发现, 2021, 5(6): 36-50.
[12] 刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] 陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[14] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15] 闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 *[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn