基于Twitter的学科领域研究前沿探测研究*

doi:10.11925/infotech.2096-3467.2022.0111

数据分析与知识发现

2023, Vol. 7

Issue (1): 89-101 https://doi.org/10.11925/infotech.2096-3467.2022.0111

研究论文

本期目录 | 过刊浏览 | 高级检索

基于Twitter的学科领域研究前沿探测研究*

江布拉提·吾喜洪^1,²,王小梅¹(

),陈挺^1,^3,⁴

¹中国科学院科技战略咨询研究院北京 100190
²中国科学院大学公共政策与管理学院北京 100049
³中国科学院文献情报中心北京 100190
⁴中国科学院大学经济与管理学院图书情报与档案管理系北京 100190

Detecting Research Frontiers Based on Twitter

Wuxihong Jiangbulati^1,²,Wang Xiaomei¹(

),Chen Ting^1,^3,⁴

¹Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100190, China
²School of Public Policy and Management, University of Chinese Academy of Sciences, Beijing 100049, China
³National Science Library, Chinese Academy of Sciences, Beijing 100190, China
⁴Department of Library, Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (799 KB) HTML ( 35 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 设计基于社交媒体Twitter的学科领域研究前沿识别方法，用于前瞻挖掘、识别学科即时动态。【方法】 通过分析Twitter识别学科领域研究前沿的原理，提出基于学者影响力及内容影响力的学科研究前沿监测指标体系并开展学科领域研究前沿探测，最后基于自然语言处理领域进行实证分析。【结果】 通过对比自然语言处理领域顶尖专家的分析报告，探测模型能够及时识别出自然语言处理领域13个研究前沿中的8个研究前沿。【局限】 由于社交媒体的开放性特征，构建数据集时难以完全避免与学科领域无关的噪音内容。【结论】 本文提出的方法基于Twitter学者用户生成内容，能够及时、前瞻识别学科领域前沿动态，是一种可行且有效的探测学科领域研究前沿的方法。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	江布拉提·吾喜洪
	王小梅
	陈挺

关键词 ：研究前沿, 社交媒体, Twitter, 领域数据集构建

Abstract：

[Objective] This paper designs a Twitter-based method to identify emerging research topics, aiming to identify the latest developments of a specific discipline. [Methods] First, we analyzed the principles and practices of using Twitter to identify research topics. Then, we proposed a monitoring index system based on the influence of scholars and contents. Third, we conducted an empirical analysis in the field of natural language processing (NLP). [Results] The detection model is able to identify emerging research topics in NLP in a timely manner. Compared with reports on NLP status quo, 8 of the 13 research frontiers were successfully identified. [Limitations] Due to the open nature of social media, it is difficult to completely avoid subject-independent noise contents during dataset construction. [Conclusions] The proposed method is based on the scholarly UGC contents on Twitter, which is a feasible and effective way to detect the research frontiers of the discipline in a timely and forward-looking way.

Key words： Research Frontiers Social Media Twitter Domain Dataset Construction

收稿日期: 2022-02-13 出版日期: 2023-02-16

ZTFLH:

G350

基金资助:*中国科学院文献情报能力建设专项的研究成果之一(GHJ-QBZX-2021-04)

通讯作者: 王小梅，ORCID：0000-0002-9895-1511，E-mail： wangxm@casisd.cn。

引用本文:

江布拉提·吾喜洪, 王小梅, 陈挺. 基于Twitter的学科领域研究前沿探测研究*[J]. 数据分析与知识发现, 2023, 7(1): 89-101.
Wuxihong Jiangbulati, Wang Xiaomei, Chen Ting. Detecting Research Frontiers Based on Twitter. Data Analysis and Knowledge Discovery, 2023, 7(1): 89-101.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0111 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I1/89

Fig.1 基于Twitter的学科领域研究前沿识别框架

Table 1 Twitter学科领域研究前沿探测指标体系

Table 2 内容影响力指标判断矩阵及一致性检验结果

Table 3 学者影响力指标判断矩阵及一致性检验结果

Table 4 认同度指标判断矩阵及一致性检验结果

Table 5 学科领域研究前沿探测模型

Fig.2 NLP领域2020年-2021年推文数量月度分布

Table 6 学科领域研究前沿构成指标要素分数统计

	发布日期	新兴研究内容	学者影响力分数	内容影响力分数	研究前沿分数
1	2020-7-13	This is mind blowing. With GPT-3， I built a layout generator where you just describe any layout you want， and it generates the JSX code for you.	2.09	3.48	2.97
2	2021-6-29	Meet GitHub Copilot - your AI pair programmer. Powered by OpenAI Codex： a large neural network that can code pretty well.	2.09	3.43	2.94
3	2021-12-8	The ongoing consolidation in AI is incredible. Thread： When I started decade ago vision， speech， natural language， reinforcement learning， etc. were completely separate； You couldn’t read papers across areas - the approaches were completely different， often not even ML based. Every ML model is converging into a Transformer that can basically be defined in 200 lines of PyTorch code. This is a great thread， Models designed to generate words （transformers） &model language （BERT） were reused in #AlphaFold to solve the protein folding problem， mapping a bunch of letters， to 3D coordinates.	2.29	2.80	2.61
4	2021-11-18	Our new AI system learned speech recognition in English with zero speech to text training data： researchers just gave it lots of audio， and it figured out what the words were. But it goes way beyond that - it learned Swahili too！ Wav2vec enables AI systems learn a language based on audio recordings with no matching text — as we’ve said before it’s a game changer for building speech AI that works in all languages， not just the dominant ones.	2.35	2.51	2.45
5	2021-9-17	New benchmark testing if models like GPT3 are truthful （= avoid generating false answers）. We find that models fail and they imitate human misconceptions. Larger models （with more params） do worse！	2.29	2.36	2.34
6	2020-8-3	Why You Should Do NLP Beyond English：7000+ languages are spoken around the world but NLP research has mostly focused on English. In this post， I give an overview of why you should work on languages other than English.	2.23	2.37	2.32
7	2021-1-6	We’ve developed two neural networks which have learned by associating text and images. CLIP maps images into categories described in text， and DALL-E creates new images. A step toward systems with deeper understanding of the world. @OpenAI is exploring the multimodal direction and discover how far we push the ability to learn vision from language supervision in massive data+compute scenarios！ CLIP： maps images to categories by taking class names as inputs； beats the original RN50 on ImageNet zero-shot（！）， while being far more robust on unusual images； DALL-E： text2im that works for a wide variety of sentences	2.01	2.54	2.35
8	2020-2-11	Microsoft researchers and engineers release Zero Redundancy Optimizer （ZeRO） and DeepSpeed library， a system able to train 100-billion-parameter deep learning models. Learn about this breakthrough and how it led to Turing Natural Language Generation.	2.11	2.37	2.28
9	2021-8-20	We use big language models to synthesize computer programs， execute programs， solve math problems， and dialog with humans to iteratively refine code.The models can solve 60% and 81% of the programming and math problems， respectively.	2.33	2.22	2.26
10	2021-9-10	We’re introducing GSLM， the first language model that breaks free completely of the dependence on text for training. This “textless NLP” approach learns to generate expressive speech using only raw audio recordings as input. There is lot more to natural languages than text： tone， accent， expression， prosody， timbre， pitch..... Textless NLP represents speech through a stream of discrete tokens， automatically learned through self-supervised learning， directly fed with raw speech waveform！ A new era.	2.35	2.19	2.25

Table 7 学科领域研究前沿识别结果

	解读	发布者	非文献形式		解读	发布者	非文献形式
1	GPT-3在文本生成中的应用	Open AI研究员	√	11	宣布实验室的新研究重点：开发支持协作构建大型模型的工具	北卡罗来纳大学教授	√
2	代码生成：Codex首次被集成到GitHub Copilot中	Open AI研究员	√	12	以人类偏好替代自动化评测方法（如ROUGE、BLUE）为训练目标，用人类反馈作为奖励进行强化学习，在文本摘要任务中的表现全面超越人类	Open AI
3	人工智能模型在各子领域的通用泛化趋势	特斯拉人工智能总监	√	13	测试GPT-3、GPT-Neo在编程中的应用	Hugging Face研究员
4	Wav2vec-U：适用于多语言且无需语音转录数据的语音辨识模型	Meta 首席技术官		14	BioMed Explorer：NLP模型在生物领域的应用	Google AI （Research and Health）研究员	√
5	TruthfulQA：测试语言模型回答开放式问题的性能	牛津大学研究员		15	大型语言模型的现状及未来	斯坦福大学研究员
6	呼吁关注NLP模型在多语言中的应用	Google AI 研究员	√	16	nlp开源库：提供语料管理及测评功能	Hugging Face研究员	√
7	多模态文本与图像神经网络CLIP & DALL·E，用于文本到图像生成	Open AI		17	To：通过结合prompt+多任务学习，在下游多任务Zero-Shot性能测试中优于GPT-3	Big Science
8	ZeRO & DeepSpeed开源库：能够训练含1000亿个参数的深度学习模型的系统	Microsoft Research	√	18	大规模语言模型依然在进展之中，能力也在继续增强	DeepMind
9	用大型语言模型合成程序	Google Brain研究员		19	REALM：一种语言预训练模型的新范例，用知识检索器增强预训练语言模型	Google AI
10	新的语言模型训练方式GSLM，从语音开始训练，无需标签或大规模数据，让每个语言都能享受大规模语言模型的便利	Meta AI		20	呼吁重点关注潜在语言现象而非专注于算法和模型的提升	巴伊兰大学教授	√

Table 8 学科领域研究前沿内容解读

Table 9 机器学习及NLP研究前沿与本文识别结果的匹配情况

[1]	刘小平, 冷伏海, 李泽霞. 国际科技前沿分析的方法和途径[J]. 图书情报工作, 2012, 56(12): 60-65.
[1]	( Liu Xiaoping, Leng Fuhai, Li Zexia. Methods and Approaches of International S&T Front Analysis[J]. Library and Information Service, 2012, 56(12): 60-65.)
[2]	罗瑞, 许海云, 董坤. 领域前沿识别方法综述[J]. 图书情报工作, 2018, 62(23): 119-131. doi: 10.13266/j.issn.0252-3116.2018.23.015
[2]	( Luo Rui, Xu Haiyun, Dong Kun. A Review of the Main Recognition Methods of Frontier Research[J]. Library and Information Service, 2018, 62(23): 119-131.) doi: 10.13266/j.issn.0252-3116.2018.23.015
[3]	段庆锋, 潘小换. 利用社交媒体识别学科新兴主题研究[J]. 情报学报, 2017, 36(12): 1216-1223.
[3]	( Duan Qingfeng, Pan Xiaohuan. Identification of Emerging Topics in Science Using Social Media[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(12): 1216-1223.)
[4]	李小涛, 李博龙, 夏小青, 等. 基于Altmetrics的国际图书情报学领域前沿分析[J]. 中华医学图书情报杂志, 2021, 30(10): 36-42.
[4]	( Li Xiaotao, Li Bolong, Xia Xiaoqing, et al. Altmetrics-Based Frontiers in Foreign Studies on Library and Information Science[J]. Chinese Journal of Medical Library and Information Science, 2021, 30(10): 36-42.)
[5]	Li X, Xie Q Q, Jiang J J, et al. Identifying and Monitoring the Development Trends of Emerging Technologies Using Patent Analysis and Twitter Data Mining: The Case of Perovskite Solar Cell Technology[J]. Technological Forecasting and Social Change, 2019, 146: 687-705. doi: 10.1016/j.techfore.2018.06.004
[6]	Zeng M A. Foresight by Online Communities—The Case of Renewable Energies[J]. Technological Forecasting and Social Change, 2018, 129: 27-42. doi: 10.1016/j.techfore.2018.01.016
[7]	Twitter. About Your Activity Dashboard[EB/OL]. [2022-05-05]. https://help.twitter.com/en/managing-your-account/using-the-tweet-activity-dashboard.
[8]	Twitter. How to Calculate Twitter Impressions and Reach[EB/OL]. [2022-05-05]. https://www.tweetbinder.com/blog/twitter-impressions.
[9]	Altmetric. Defining a Mention[EB/OL]. [2022-05-05]. https://help.altmetric.com/support/solutions/articles/6000240575-defining-a-mention.
[10]	Peoples B K, Midway S R, Sackett D, et al. Twitter Predicts Citation Rates of Ecological Research[J]. PLoS One, 2016, 11(11): e0166570. doi: 10.1371/journal.pone.0166570
[11]	Luc J G Y, Archer M A, Arora R C, et al. Does Tweeting Improve Citations? One-Year Results from the TSSMN Prospective Randomized Trial[J]. The Annals of Thoracic Surgery, 2021, 111(1): 296-300. doi: 10.1016/j.athoracsur.2020.04.065
[12]	Pemmaraju N, Utengen A, Gupta V, et al. Social Media and Myeloproliferative Neoplasms(MPN): Analysis of Advanced Metrics from the First Year of a New Twitter Community: #MPNSM[J]. Current Hematologic Malignancy Reports, 2016, 11(6): 456-461. doi: 10.1007/s11899-016-0341-2 pmid: 27492118
[13]	Xia F, Su X Y, Wang W, et al. Bibliographic Analysis of Nature Based on Twitter and Facebook Altmetrics Data[J]. PLoS One, 2016, 11(12): e0165997. doi: 10.1371/journal.pone.0165997
[14]	王超, 马铭, 李思思, 等. Altmetrics视角下颠覆性技术的社会影响力探测研究[J]. 情报理论与实践, 2022, 45(1): 93-104.
[14]	( Wang Chao, Ma Ming, Li Sisi, et al. A Study on the Social Impact of Disruptive Technologies Using Altmetrics Indicators[J]. Information Studies: Theory & Application, 2022, 45(1): 93-104.)
[15]	Fang Z. Towards Advanced Social Media Metrics: Understanding the Diversity and Characteristics of Twitter Interactions Around Science[D]. Leiden: Leiden University, 2021.
[16]	Sugimoto C. “Attention is Not Impact” and Other Challenges for Altmetrics[OL]. [2022-05-05]. https://www.wiley.com/en-us/network/publishing/research-publishing/promoting-your-article/attention-is-not-impact-and-other-challenges-for-altmetrics.
[17]	Haunschild R, Bornmann L, Potnis D, et al. Investigating Dissemination of Scientific Information on Twitter: A Study of Topic Networks in Opioid Publications[J]. Quantitative Science Studies, 2021, 2(4): 1486-1510. doi: 10.1162/qss_a_00168
[18]	Daneshjou R, Shmuylovich L, Grada A, et al. Research Techniques Made Simple: Scientific Communication Using Twitter[J]. Journal of Investigative Dermatology, 2021, 141(7): 1615-1621.e1. doi: 10.1016/j.jid.2021.03.026 pmid: 34167718
[19]	Holmberg K, Thelwall M. Disciplinary Differences in Twitter Scholarly Communication[J]. Scientometrics, 2014, 101(2): 1027-1042. doi: 10.1007/s11192-014-1229-3
[20]	Fang Z C, Costas R, Tian W C, et al. An Extensive Analysis of the Presence of Altmetric Data for Web of Science Publications Across Subject Fields and Research Topics[J]. Scientometrics, 2020, 124(3): 2519-2549. doi: 10.1007/s11192-020-03564-9
[21]	Fang Z C, Costas R. Studying the Accumulation Velocity of Altmetric Data Tracked by Altmetric.com[J]. Scientometrics, 2020, 123(2): 1077-1101. doi: 10.1007/s11192-020-03405-9
[22]	Ortega J L. The Life Cycle of Altmetric Impact: A Longitudinal Study of Six Metrics from PlumX[J]. Journal of Informetrics, 2018, 12(3): 579-589. doi: 10.1016/j.joi.2018.06.001
[23]	Van Noorden R. Online Collaboration: Scientists and the Social Network[J]. Nature, 2014, 512(7513): 126-129. doi: 10.1038/512126a
[24]	Breitzman A, Thomas P. The Emerging Clusters Model: A Tool for Identifying Emerging Technologies across Multiple Patent Systems[J]. Research Policy, 2015, 44(1): 195-205. doi: 10.1016/j.respol.2014.06.006
[25]	Fang Z C, Dudek J, Costas R. The Stability of Twitter Metrics: A Study on Unavailable Twitter Mentions of Scientific Publications[J]. Journal of the Association for Information Science and Technology, 2020, 71(12): 1455-1469. doi: 10.1002/asi.24344
[26]	Cesare N, Grant C, Nguyen Q, et al. Detection of User Demographics on Social Media: A Review of Methods and Recommendations for Best Practices[OL]. arXiv Preprint, arXiv: 1702.01807.
[27]	Wen X D, Lin Y R, Trattner C, et al. Twitter in Academic Conferences: Usage, Networking and Participation over Time[C]// Proceedings of the 25th ACM Conference on Hypertext and Social Media. 2014: 285-290.
[28]	Priem J, Hemminger B H. Scientometrics 2.0: New Metrics of Scholarly Impact on the Social Web[J]. First Monday, 2010. DOI: https://doi.org/10.5210/fm.v15i7.2874. doi: https://doi.org/10.5210/fm.v15i7.2874
[29]	Ke Q, Ahn Y Y, Sugimoto C R. A Systematic Identification and Analysis of Scientists on Twitter[J]. PLoS One, 2017, 12(4): e0175368. doi: 10.1371/journal.pone.0175368
[30]	Schmitt M, Jäschke R. What do Computer Scientists Tweet? Analyzing the Link-Sharing Practice on Twitter[J]. PLoS One, 2017, 12(6): e0179630. doi: 10.1371/journal.pone.0179630
[31]	Vainio J, Holmberg K. Highly Tweeted Science Articles: Who Tweets Them? An Analysis of Twitter User Profile Descriptions[J]. Scientometrics, 2017, 112(1): 345-366. doi: 10.1007/s11192-017-2368-0
[32]	ResearchGate. RG Score[EB/OL]. [2022-05-05]. https://explore.researchgate.net/display/support/RG+Score.
[33]	朱郭峰, 杨彦, 周竹荣, 等. 基于领域的微博用户影响力计算方法[J]. 西南大学学报(自然科学版), 2014, 36(3): 145-151.
[33]	Zhu Guofeng, Yang Yan, Zhou Zhurong, et al. A Method of Calculating the Influence of Micro-Blog Users Based on Domain[J]. Journal of Southwest University(Natural Science Edition), 2014, 36(3): 145-151.)
[34]	Díaz-Faes A A, Bowman T D, Costas R. Towards a Second Generation of ‘Social Media Metrics’: Characterizing Twitter Communities of Attention Around Science[J]. PLoS One, 2019, 14(5): e0216408. doi: 10.1371/journal.pone.0216408
[35]	兰月新. 突发事件网络舆情安全评估指标体系构建[J]. 图书情报工作, 2011, 55(S1): 317-319.
[35]	( Lan Yuexin. On Construction of Emergency Network Safety Evaluation Index System[J]. Library and Information Service, 2011, 55(S1): 317-319.)
[36]	Ruder S. ML and NLP Research Highlights of 2021[EB/OL]. [2022-02-23]. https://ruder.io/ml-highlights-2021/index.html#5efficientmethods.

[1]	李雪丽, 黄令贺, 陈佳星. 基于元分析的社交媒体用户隐私披露意愿影响因素研究^*[J]. 数据分析与知识发现, 2022, 6(4): 97-107.
[2]	李纲, 张霁, 毛进. 面向突发事件画像的社交媒体图像分类研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 67-79.
[3]	冯小东, 惠康欣. 基于异构图神经网络的社交媒体文本主题聚类^*[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
[4]	安璐, 徐曼婷. 突发公共卫生事件情境下网民对政务微博信任度的测度*[J]. 数据分析与知识发现, 2022, 6(1): 55-68.
[5]	谢豪,毛进,李纲. 基于多层语义融合的图文信息情感分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[6]	马莹雪,赵吉昌. *自然灾害期间微博平台的舆情特征及演变^——以台风和暴雨数据为例**[J]. 数据分析与知识发现, 2021, 5(6): 66-79.
[7]	张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测^*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[8]	刘倩, 李晨亮. 基于社交媒体的话题演变研究综述*[J]. 数据分析与知识发现, 2020, 4(8): 1-14.
[9]	李纲, 管为栋, 马亚雪, 毛进. 学术论文的社交媒体可见性预测研究*[J]. 数据分析与知识发现, 2020, 4(8): 63-74.
[10]	谭荧,张进,夏立新. 社交媒体情境下的情感分析研究综述[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[11]	刘博文,白如江,周彦廷,王效岳. *基金项目数据和论文数据融合视角下科学研究前沿主题识别 ^——以碳纳米管领域为例**[J]. 数据分析与知识发现, 2019, 3(8): 114-122.
[12]	吴小兰,章成志. *学术社交媒体视角下学科知识流动规律研究^——以科学网为例**[J]. 数据分析与知识发现, 2019, 3(4): 107-116.
[13]	王林,王可,吴江. *社交媒体中突发公共卫生事件舆情传播与演变^——以2018年疫苗事件为例**[J]. 数据分析与知识发现, 2019, 3(4): 42-52.
[14]	王晰巍,王铎,郑晴晓,韦雅楠. *在线品牌社群环境下企业与用户的信息互动研究^——以虚拟现实产业为例**[J]. 数据分析与知识发现, 2019, 3(3): 83-94.
[15]	蒋翠清,郭轶博,刘尧. 基于中文社交媒体文本的领域情感词典构建方法研究^*[J]. 数据分析与知识发现, 2019, 3(2): 98-107.

Viewed

Full text

Abstract

Cited

Shared

Discussed