Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (8): 63-74    DOI: 10.11925/infotech.2096-3467.2020.0124
Current Issue | Archive | Adv Search |
Predicting Social Media Visibility of Scholarly Articles
Li Gang,Guan Weidong,Ma Yaxue(),Mao Jin
Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
Download: PDF (959 KB)   HTML ( 20
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to predict visibility of research papers on Twitter with their multidimensional features, aiming to find important factors affecting social media visibility. [Methods] First, we decided each paper’s social media visibility by its total mentions on Twitter, and extracted features from paper contents, authorship and publishing journals. Then, we constructed a binary classification model to predict each paper’s Twitter visibility. Finally, we examined our model with papers on diabetes to evaluate the performance of different algorithms and the importance of all features. [Results] LightGBM had the best performance with an accuracy of 0.70. Features from contents, authorship and publishing journals all influenced an article’s visibility on social media, while a journal’s annual average impact factor was the most important one. [Limitations] We only examined visiblity of diabete related papers on Twitter. [Conclusions] Ensemble learning algorithm is an effective method to predict social media visibility of scholarly articles, while features of the publishing journals are the key factors.

Key wordsScientific Paper      Social Media      Visibility Prediction      Feature Importance     
Received: 21 February 2020      Published: 14 September 2020
ZTFLH:  G353  
Corresponding Authors: Ma Yaxue     E-mail: myx_vicky@whu.edu.cn

Cite this article:

Li Gang, Guan Weidong, Ma Yaxue, Mao Jin. Predicting Social Media Visibility of Scholarly Articles. Data Analysis and Knowledge Discovery, 2020, 4(8): 63-74.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0124     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I8/63

特征名 特征来源与计算方式
主题类别 利用论文的标题、摘要、关键词等文本内容,建立LDA主题模型,为每一篇论文分配主题编号
Web of Science类别 Web of Science元数据
语言 Web of Science元数据
文章类型 Web of Science元数据
开放获取状态 Web of Science元数据
论文长度(页数) Web of Science元数据
关键词数量 统计关键词列表中关键词的个数
基金资助数量 统计基金列表中基金资助机构和授权号的个数
出版时长(以月为单位) 计算自论文正式出版年月起,至2019年8月的时间跨度(若出版月数据缺失,视为出版年的1月出版)
使用次数(2013年至今) Web of Science元数据
被引频次(WOS核心合集) Web of Science元数据
Paper-related Features
特征名 特征来源与计算方式
第一作者的H指数 统计出在数据集内,每一位作者发表的全部论文的被引频次,从大到小排列,计算每一位作者的H指数,由此对应到每一篇论文,得到第一作者、通讯作者的H指数及团队的平均H指数
通讯作者的H指数
团队的平均H指数
第一作者的发文量 统计出在数据集内,每一位作者的发文量,并对应到每一篇论文得到第一作者、通讯作者的发文量及团队的平均发文量
通讯作者的发文量
团队的平均发文量
第一作者的被引量 统计出在数据集内,每一位作者发表全部论文的被引量之和,由此对应到每一篇论文得到第一作者、通讯作者的被引量及团队的平均被引量
通讯作者的被引量
团队的平均被引量
作者数量 Web of Science元数据解析
作者机构数量 Web of Science元数据解析
作者的国别数量 Web of Science元数据解析
Author-related Features
Author Disambiguation Process
特征名 特征来源与计算方式
期刊年均被引量 计算每本期刊在各年JCR中的Total Cites指标的平均值
期刊年均影响因子 计算每本期刊在各年JCR中的Impact Factor指标的平均值
期刊年均特征因子分值 计算每本期刊在各年JCR中Eigenfactor Score指标的平均值
Journal-related Features
Overall Process of Classification Model
论文数量 期刊
种类
语言
类别
文章
类别
开放获取状态 WOS
类别
主题
类别
119 334 4 753 24 3 6 182 20
Descriptive Statistics of Diabetes Mellitus Paper Data Set
Topic Distribution of Diabetes Mellitus Papers
类别 论文数量 占比
被提及 60 898 51%
未被提及 58 436 49%
合计 119 334 100%
Distribution of Visibility on Twitter of the Diabetes Mellitus Papers
特征名 社交媒体可见的论文 社交媒体不可见的论文
均值 中位数 标准差 均值 中位数 标准差
论文长度(页数) 9.72 9 10.186 8.20 8 4.012
关键词数量 3.36 4 2.858 3.74 4 2.470
基金资助数量 2.36 1 4.053 1.50 1 2.535
出版时长 44.16 43 22.708 47.87 47 24.357
使用次数 11.55 6 26.539 7.29 4 13.261
被引频次 19.38 8 56.911 9.32 5 26.009
Statistical Indicators of Paper-related Features
Topic Distribution of Diabetes Mellitus Papers Visible on Twitter
特征名 社交媒体可见的论文 社交媒体不可见的论文
均值 中位数 标准差 均值 中位数 标准差
作者数量 7.19 6 19.821 6.22 6 3.909
作者的国别数量 1.50 1 1.438 1.26 1 0.746
作者机构数量 4.43 3 11.478 3.39 3 2.648
团队的平均H指数 2.99 2 2.676 2.35 2 2.196
团队的平均被引量 115.39 37 240.615 63.11 18 152.451
团队的平均发文量 4.32 3 4.823 3.52 2 4.172
第一作者的H指数 2.58 1 3.117 2.08 1 2.487
第一作者的被引量 84.64 18 286.605 46.58 10 186.864
第一作者的发文量 3.57 2 5.696 3.01 1 4.892
通讯作者的H指数 3.62 2 4.266 2.89 2 3.456
通讯作者的被引量 142.80 28 421.363 82.61 15 271.833
通讯作者的发文量 5.46 2 8.668 4.57 2 7.543
Statistical Indicators of Author-related Features
特征名 社交媒体可见的论文 社交媒体不可见的论文
均值 中位数 标准差 均值 中位数 标准差
期刊年均被引量 35 646.31 7 199 88 380.092 21 855.02 3 278 72 456.536
期刊年均影响因子 4.79 3.188 5.762 2.63 2.398 2.237
期刊年均特征因子分值 0.10 0.016 4 0.299 0.06 0.007 07 0.263
Statistical Indicators of Journal-related Features
Top 10 Journals with the Highest Amount of Diabetes Mellitus Papers Visible on Twitter
每本期刊的被提及论文数 期刊数(种) 被提及论文总量(篇)
1~10篇 2 897 9 624
11~100篇 918 25 916
101~1 000篇 88 18 859
大于1 000篇 4 6 499
Distribution of the Amount of Diabetes Mellitus Papers Visible on Twitter of Journals
分类算法 准确率 精确率 召回率 F1值
LightGBM 0.70 0.72 0.68 0.70
随机森林 0.69 0.71 0.68 0.70
AdaBoost 0.68 0.69 0.68 0.69
支持向量机 0.68 0.71 0.66 0.68
逻辑回归 0.67 0.69 0.66 0.67
人工神经网络 0.65 0.61 0.99 0.67
朴素贝叶斯 0.54 0.53 0.96 0.68
Social Media Visibility Prediction for Diabetes Mellitus Papers
特征名 重要性
期刊年均影响因子 0.074
出版时长 0.061
期刊年均特征因子分值 0.055
期刊年均被引量 0.052
团队的平均被引量 0.047
使用次数 0.047
被引频次 0.043
通讯作者的被引量 0.041
第一作者的被引量 0.040
论文长度(页数) 0.040
Feature Importance of Scientific Papers
[1] Holmberg K, Park H W. An Altmetric Investigation of the Online Visibility of South Korea-based Scientific Journals[J]. Scientometrics, 2018,117(1):603-613.
[2] Kjellberg S, Haider J. Researchers’ Online Visibility: Tensions of Visibility, Trust and Reputation[J]. Online Information Review, 2019,43(3):426-439.
[3] Bar-Ilan J, Haustein S, Peters I, et al. Beyond Citations: Scholars’ Visibility on the Social Web[C]// Proceedings of the 17th International Conference on Science and Technology Indicators, Montreal, Canada. 2012.
[4] Alperin J P, Gomez C J, Haustein S. Identifying Diffusion Patterns of Research Articles on Twitter: A Case Study of Online Engagement with Open Access Articles[J]. Public Understanding of Science, 2019,28(1):2-18.
doi: 10.1177/0963662518761733 pmid: 29607775
[5] Zhang L W, Wang J. Why Highly Cited Articles are not Highly Tweeted? A Biology Case[J]. Scientometrics, 2018,117(1):495-509.
doi: 10.1007/s11192-018-2876-6
[6] Lucassen T, Schraagen J M. Factual Accuracy and Trust in Information: The Role of Expertise[J]. Journal of the American Society for Information Science and Technology, 2011,62(7):1232-1242.
doi: 10.1002/asi.21545
[7] Petersen A M, Vincent E M, Westerling A L R. Discrepancy in Scientific Authority and Media Visibility of Climate Change Scientists and Contrarians[J]. Nature Communications, 2019,10(1):1-14.
doi: 10.1038/s41467-018-07882-8 pmid: 30602773
[8] Shu F, Lou W, Haustein S. Can Twitter Increase the Visibility of Chinese Publications?[J]. Scientometrics, 2018,116(1):505-519.
doi: 10.1007/s11192-018-2732-8
[9] Thelwall M, Sud P. Mendeley Readership Counts: An Investigation of Temporal and Disciplinary Differences[J]. Journal of the Association for Information Science and Technology, 2016,67(12):3036-3050.
doi: 10.1002/asi.2016.67.issue-12
[10] Eldakar M A M. Who Reads International Egyptian Academic Articles? An Altmetrics Analysis of Mendeley Readership Categories[J]. Scientometrics, 2019,121(1):105-135.
doi: 10.1007/s11192-019-03189-7
[11] Holmberg K, Vainio J. Why do Some Research Articles Receive More Online Attention and Higher Altmetrics? Reasons for Online Success According to the Authors[J]. Scientometrics, 2018,116(1):435-447.
doi: 10.1007/s11192-018-2710-1
[12] Tahamtan I, Safipour Afshar A, Ahamdzadeh K. Factors Affecting Number of Citations: A Comprehensive Review of the Literature[J]. Scientometrics, 2016,107(3):1195-1225.
doi: 10.1007/s11192-016-1889-2
[13] Xie J, Gong K L, Li J, et al. A Probe into 66 Factors which are Possibly Associated with the Number of Citations an Article Received[J]. Scientometrics, 2019,119(3):1429-1454.
[14] Xie J, Gong K L, Cheng Y, et al. The Correlation between Paper Length and Citations: A Meta-analysis[J]. Scientometrics, 2019,118(3):763-786.
doi: 10.1007/s11192-019-03015-0
[15] Rostami F, Mohammadpoorasl A, Hajizadeh M. The Effect of Characteristics of Title on Citation Rates of Articles[J]. Scientometrics, 2014,98(3):2007-2010.
doi: 10.1007/s11192-013-1118-1
[16] Mingers J, Xu F. The Drivers of Citations in Management Science Journals[J]. European Journal of Operational Research, 2010,205(2):422-430.
doi: 10.1016/j.ejor.2009.12.008
[17] Yan E, Wu C J, Song M. The Funding Factor: A Cross-disciplinary Examination of the Association Between Research Funding and Citation Impact[J]. Scientometrics, 2018,115(1):369-384.
doi: 10.1007/s11192-017-2583-8
[18] Craig I D, Plume A M, McVeigh M E, et al. Do Open Access Articles Have Greater Citation Impact?: A Critical Review of the Literature[J]. Journal of Informetrics, 2007,1(3):239-248.
doi: 10.1016/j.joi.2007.04.001
[19] Chen C M. Predictive Effects of Structural Variation on Citation Counts[J]. Journal of the American Society for Information Science and Technology, 2012,63(3):431-449.
doi: 10.1002/asi.21694
[20] Willis D L, Bahler C D, Neuberger M M, et al. Predictors of Citations in the Urological Literature[J]. BJU International, 2011,107(12):1876-1880.
doi: 10.1111/j.1464-410X.2010.10028.x pmid: 21332629
[21] Hurley L A, Ogier A L, Torvik V I. Deconstructing the Collaborative Impact: Article and Author Characteristics that Influence Citation Count[J]. Proceedings of the American Society for Information Science and Technology, 2013,50(1):1-10.
[22] Franceschet M, Costantini A. The Effect of Scholar Collaboration on Impact and Quality of Academic Papers[J]. Journal of Informetrics, 2010,4(4):540-553.
[23] Roldan-Valadez E, Rios C. Alternative Bibliometrics from Impact Factor Improved the Esteem of a Journal in a 2-year-ahead Annual-citation Calculation[J]. European Journal of Gastroenterology & Hepatology, 2015,27(2):115-122.
doi: 10.1097/MEG.0000000000000253 pmid: 25533428
[24] Diekhoff T, Schlattmann P, Dewey M. Impact of Article Language in Multi-language Medical Journals-a Bibliometric Analysis of Self-citations and Impact Factor[J]. PLoS One, 2013,8(10):e76816.
doi: 10.1371/journal.pone.0076816 pmid: 24146929
[25] Winnik S, Raptis D A, Walker J H, et al. From Abstract to Impact in Cardiovascular Research: Factors Predicting Publication and Citation[J]. European Heart Journal, 2012,33(24):3034-3045.
doi: 10.1093/eurheartj/ehs113 pmid: 22669850
[26] Ringelhan S, Wollersheim J, Welpe I M. I Like, I Cite? Do Facebook Likes Predict the Impact of Scientific Work?[J]. PLoS One, 2015,10(8):e0134389.
doi: 10.1371/journal.pone.0134389 pmid: 26244779
[27] 吴朋民, 陈挺, 王小梅. Altmetrics 与引文指标相关性研究[J]. 数据分析与知识发现, 2018,2(6):58-69.
[27] ( Wu Pengmin, Chen Ting, Wang Xiaomei. The Correlation Between Altmetrics and Citations[J]. Data Analysis and Knowledge Discovery, 2018,2(6):58-69.)
[28] Abrishami A, Aliakbary S. Predicting Citation Counts Based on Deep Neural Network Learning Techniques[J]. Journal of Informetrics, 2019,13(2):485-499.
[29] Bai X M, Zhang F L, Lee I. Predicting the Citations of Scholarly Paper[J]. Journal of Informetrics, 2019,13(1):407-418.
[30] Yu T, Yu G, Li P Y, et al. Citation Impact Prediction for Scientific Papers Using Stepwise Regression Analysis[J]. Scientometrics, 2014,101(2):1233-1252.
[31] Cao X, Chen Y, Liu K J R, A Data Analytic Approach to Quantifying Scientific Impact[J]. Journal of Informetrics, 2016,10(2):471-484.
[32] Singh M, Jaiswal A, Shree P, et al. Understanding the Impact of Early Citers on Long-term Scientific Impact[C]// Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). 2017: 1-10.
[33] Sarigöl E, Pfitzner R, Scholtes I, et al. Predicting Scientific Success Based on Coauthorship Networks[C]// EPJ Data Science 2014, 44 3(1): Article No. 9.
[34] Pobiedina N, Ichise R. Citation Count Prediction as a Link Prediction Problem[J]. Applied Intelligence, 2016,44(2):252-268.
[35] 耿骞, 景然, 靳健, 等. 学术论文引用预测及影响因素分析[J]. 图书情报工作, 2018,62(14):29-40.
[35] ( Geng Qian, Jing Ran, Jin Jian, et al. Citation Prediction and Influencing Factors Analysis on Academic Papers[J]. Library and Information Service, 2018,62(14):29-40.)
[36] Robson B J, Mousquès A. Can We Predict Citation Counts of Environmental Modelling Papers? Fourteen Bibliographic and Categorical Variables Predict Less than 30% of the Variability in Citation Counts[J]. Environmental Modelling & Software, 2016,75:94-104.
[37] Sohrabi B, Iraj H. The Effect of Keyword Repetition in Abstract and Keyword Frequency per Journal in Predicting Citation Counts[J]. Scientometrics, 2017,110(1):243-251.
[38] Chen J P, Zhang C X. Predicting Citation Counts of Papers[C]// Proceedings of 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC). 2015: 434-440.
[39] Waltman L, Van Eck N J, Van Raan A F J. Universality of Citation Distributions Revisited[J]. Journal of the American Society for Information Science and Technology, 2012,63(1):72-77.
[40] Eom Y H, Fortunato S. Characterizing and Modeling Citation Dynamics[J]. PLoS One, 2011,6(9):e24926.
doi: 10.1371/journal.pone.0024926 pmid: 21966387
[41] Lv L Y, Zhou T. Link Prediction in Complex Networks: A Survey[J]. Physica A: Statistical Mechanics and Its Applications, 2011,390(6):1150-1170.
[42] 张斌, 李亚婷. 学科合作网络链路预测结果的排序鲁棒性[J]. 信息资源管理学报, 2018,8(4):89-97.
[42] ( Zhang Bin, Li Yating. Ranking Robustness of Link Prediction Results in Disciplinary Collaboration Network[J]. Journal of Information Resources Management, 2018,8(4):89-97.)
[43] Hirsch J E. An Index to Quantify an Individual’s Scientific Research Output[J]. Proceedings of the National Academy of Sciences, 2005,102(46):16569-16572.
[44] Sinatra R, Wang D S, Deville P, et al. Quantifying the Evolution of Individual Scientific Impact[J]. Science, 2016, 354(6312):aaf5239.
doi: 10.1126/science.aaf5239 pmid: 27811240
[45] Information and Documentation — Guidelines for Bibliographic References and Citations to Information Resources:2010 [S/OL]. [2010-06-15]. https://www.iso.org/standard/72642.html.
[46] Kohavi R. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection[C]// Proceedings of the 14th International Joint Conference on Artificial Intelligence-Volume 2. 1995: 1137-1143.
[47] WHO. The Top 10 Causes of Death[R/OL].[2018-05-24].https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death.
[48] Bethesda (MD): National Center for Biotechnology Information. PubMed Help[EB/OL].[2019-07-25]. https://www.ncbi.nlm.nih.gov/books/NBK3827.
[49] Haustein S, Sugimoto C R, Larivière V, et al. The Thematic Orientation of Publications Mentioned on Social Media[J]. Aslib Journal of Information Management, 2015,67(3):260-288.
[50] 王睿, 胡文静, 郭玮, 等. 高Altmetrics指标科技论文学术影响力研究[J]. 图书情报工作, 2014,58(21):92-98.
[50] ( Wang Rui, Hu WenJing, Guo Wei, et al. Research on Academic Influence of High Altmetrics Sci-tech Papers[J]. Library and Information Service, 2014,58(21):92-98.)
[51] Altmetric. What Outputs and Sources does Altmetric Track?[EB/OL]. [2019-07-25]. https://help.altmetric.com/support/solutions/articles/6000060968-what-data-sources-does-altmetric-track-.
[52] 方志超, 王贤文. 科学论文首条推特的积累速度与用户类型分析[J]. 图书情报知识, 2019(2):28-38.
[52] ( Fang Zhichao, Wang Xianwen. Study on the Accumulation Speed and User Type of Scientific Publications’ First Tweets[J]. Documentation, Information & Knowledge, 2019(2):28-38.)
[1] Xie Hao,Mao Jin,Li Gang. Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[2] Ma Yingxue,Zhao Jichang. Patterns and Evolution of Public Opinion on Weibo During Natural Disasters: Case Study of Typhoons and Rainstorms[J]. 数据分析与知识发现, 2021, 5(6): 66-79.
[3] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[4] Liu Qian, Li Chenliang. A Survey of Topic Evolution on Social Media[J]. 数据分析与知识发现, 2020, 4(8): 1-14.
[5] Ying Tan,Jin Zhang,Lixin Xia. A Survey of Sentiment Analysis on Social Media[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[6] Lin Wang,Ke Wang,Jiang Wu. Public Opinion Propagation and Evolution of Public Health Emergencies in Social Media Era: A Case Study of 2018 Vaccine Event[J]. 数据分析与知识发现, 2019, 3(4): 42-52.
[7] Xiwei Wang,Duo Wang,Qingxiao Zheng,Ya’nan Wei. Information Interaction Between User and Enterprise in Online Brand Community: A Study of Virtual Reality Industry[J]. 数据分析与知识发现, 2019, 3(3): 83-94.
[8] Xiaoxiao Zhu,Zunqi Yang,Jing Liu. Construction of an Adverse Drug Reaction Extraction Model Based on Bi-LSTM and CRF[J]. 数据分析与知识发现, 2019, 3(2): 90-97.
[9] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[10] Gang Li,Sijing Chen,Jin Mao,Yansong Gu. Spatio-Temporal Comparison of Microblog Trending Topics on Natural Disasters[J]. 数据分析与知识发现, 2019, 3(11): 1-15.
[11] Li Lei,He Daqing,Zhang Chengzhi. Survey on Social Question and Answer[J]. 数据分析与知识发现, 2018, 2(7): 1-12.
[12] Jing Dong,Zhang Dayong. Assessing Trust-Based Users’ Influence in Social Media[J]. 数据分析与知识发现, 2018, 2(7): 26-33.
[13] Li Baozhen,Wang Ya,Zhou Ke. Measuring Credibility of Social Media Contents Based on Bayesian Theory[J]. 数据分析与知识发现, 2017, 1(6): 83-92.
[14] Li Dan. Improving Library Services with the Help of WeChat[J]. 现代图书情报技术, 2016, 32(4): 104-110.
[15] Haihan Liao, Yuefen Wang. Public Opinion Dissemination over Social Media: Case Study of Sina Weibo and “8.12 Tianjin Explosion”[J]. 数据分析与知识发现, 2016, 32(12): 85-93.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn