Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (4): 38-45    DOI: 10.11925/infotech.2096-3467.2017.04.05
Recommending Scientific Research Collaborators with Link Prediction and Extremely Randomized Trees Algorithm
Lv Weimin1,2, Wang Xiaomei3(), Han Tao1
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2University of Chinese Academy of Sciences, Beijing 100049, China
3Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper proposes a method to recommend scientific research collaborators based on link prediction and machine learning, which improves the precision of traditional method. [Methods] First, we used Link Prediction Algorithm index to build the feature input, and adopted the Extremely Randomized Trees Algorithm to train the classifier. Then, we obtained the optimal weight combination with the traversal algorithm to combine the classification results linearly. Finally, we received the best recommendation of collaborators. [Results] The improved ET method had better performance than the existing ones in recommending the collaboration cities. Besides, the proposed method was less affected by factors such as the network structure, and could be used with more applications. [Limitations] Scientific research collaboration is affected by the cooperation motivation, geographical, language and many other factors. The weighted author network did not examine authors from the same cities or with the same organizations. [Conclusions] The propsoed method could produce better recommendation results, which might help universities, institutions and individuals identify academic collabortors.

Key wordsScientific Research Collaboration Network      Link Prediction      Machine Learning      Random Forest      Extremely Randomized Trees      Recommendation     
Received: 16 January 2017      Published: 24 May 2017
Lv Weimin,Wang Xiaomei,Han Tao. Recommending Scientific Research Collaborators with Link Prediction and Extremely Randomized Trees Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(4): 38-45.

主要方法 代表性研究
指标加权 Guns[18]以安德鲁大学学院合作网络以及计量情报学领域的合作网络为例, 得出加权的链路预测指标比不加权指标预测效果要好。
基于时序分析 Tylenda等[19]考虑时间进化对预测结果的影响, 在Wang等[20]提出的局部概率模型基础上, 推导出考虑时间信息的最大熵原则方法, 把作者a、b最后一次合作到现在间隔的时间长度融入到加权的链路预测指标中, 提升链路预测的预测成功率。
不同层面网络对比 Yan等[13]从作者、机构、国家三个层面构造合作网络进行研究, 对比三个层面合作网络在8种独立预测指标下的预测结果, 发现越高层面预测精确度越高, 即国家层面高于机构层面高于个人层面。
加权网络 Liben-Nowell等[16]提出, 可以利用网络拓扑结构特征, 将论文标题、作者所在机构和地理位置信息加入到计算中, 对链路预测方法进行微调。具体实施时, Guns[21]将这些信息以不同层面的网络形式表现出来, 提出一种Multi-Input方法, 构建作者合作网络、部门网络和物理位置网络, 将三个子网络线性加权构成训练集。
数据说明 2008年 2009年 2010年
论文数/篇 120 027 139 810 148 426
点个数 4 638 5 088 5 400
边条数 39 712 47 689 53 073
精确度 指标(Weighted)
AA CN GD Katz RA SimRank RF
$n=5$ 80% 80% 80% 60% 80% 60% 60% 60%
$n=10$ 80% 80% 90% 80% 90% 40% 60% 80%
$n=\text{20}$ 85% 80% 90% 80% 85% 30% 62% 80%
([AA, CN, GD, Katz, RA])
$n=5$ $n=10$ $n=\text{20}$
[0.0, 0.0, 1.0, 0.0, 0.0] 100% 97% 85%
[0.05, 0.0, 0.85, 0.0, 0.1] 100% 90% 90%
[0.0, 0.05, 0.85, 0.0, 0.1] 100% 90% 90%
[0.0, 0.0, 0.9, 0.0, 0.1] 100% 90% 90%
[0.0, 0.0, 0.85, 0.05, 0.1] 96% 90% 90%
