中图分类号: TP391
通讯作者:
收稿日期: 2017-09-22
修回日期: 2017-10-31
网络出版日期: 2018-03-25
版权声明: 2018 《数据分析与知识发现》编辑部 《数据分析与知识发现》编辑部
基金资助:
展开
摘要
【目的】减少标签传播算法的无效更新、解决算法准确率低的问题。【方法】引入节点信息列表以指导更新过程, 避免不必要的更新, 从而加快执行速度; 采取基于节点对社区偏向程度的更新规则, 提高社区划分的准确率。【结果】实验结果表明, 相比标签传播算法和两种较好的改进算法, 本文提出的基于速度优化和社区偏向的标签传播算法在较大规模网络上的迭代次数减少了几十倍, 在真实网络数据集的模块度相对较高, 在LFR基准网络数据集的归一化互信息值和F-measure值分别有明显提高。【局限】更新顺序具有随机性, 需进一步研究。【结论】本文算法在提高执行速度的基础上, 提高了社区发现的准确率。
关键词:
Abstract
[Objective] This paper aims to reduce the unnecessary updates and improve the accuracy of Label Propagation Algorithm. [Methods] First, we used the node information list to direct the update process and increase the execution speed. Then, we proposed new updating rules based on the node preference to improve the accuracy of community detection. [Results] Compared with the classic label propagation algorithm and two improved algorithms, the proposed one significantly reduced the number of iterations on large-scale social networks, as well as improved the value of Normalized Mutual Information and F-measure of LFR benchmark network. [Limitations] The new algorithm’s updating sequence is random, which needs to be investigated in further studies. [Conclusions] The SOCP_LPA improves the accuracy of community detection and the processing speed.
Keywords:
自然界与人类社会中的许多系统都可以用复杂网络模型表示, 如社交网络、论文合著网、电子邮件网、博客引用网、新陈代谢网等。复杂网络具有自组织、无标度以及网络社区结构等特性, 其中网络社区结构是复杂网络中最普遍和最重要的拓扑属性。社区是复杂网络中一类节点的集合, 集合内部节点之间的相互作用比它们与外部节点的相互作用更强。社区发现算法可以通过分析网络中节点之间的关联性, 挖掘复杂网络中的社区结构, 从而揭示复杂网络中隐含的特性和功能, 对于网络结构的理论研究和网络分析的实际应用有着重要的作用。
已有许多理论和算法可以用于发现复杂网络中的社区结构, 例如基于图分割的算法[1], 基于层次聚类的算法[2,3], 基于随机游走的算法[4], 基于遗传思想的算法[5], 基于模块度优化的算法[6,7,8], 基于谱聚类的算法[9,10,11]等, 这些算法在社区发现中取得了不错的效果。但是上述算法致力于对网络结构进行划分来形成社区, 而当网络节点数量不断增加、节点间关系变得更加复杂时, 这些算法已经无法有效地挖掘社区结构。2007年, Raghavan等[12]将Zhu等 [13]提出的标签传播算法(Label Propagation Algorithm, LPA)首次应用于社区发现。与前几类方法相比, LPA不需要知道网络结构或者先验社区结构, 仅依赖于网络的传播特性, 且具有线性时间复杂度, 适合用于对复杂网络进行社区发现和分析。但是LPA也有一些缺点: 存在不必要的标签更新过程; LPA标签更新规则仅仅考虑邻接节点中标签出现的次数, 忽略邻接节点的局部网络结构, 当出现次数最多的标签有多个时, 会随机选择一个标签。这些缺点会导致算法延迟收敛、社区发现的结果准确率低的问题。
针对以上问题, 2013年, Xie等[14]将LPA和马尔科夫聚类算法相结合, 使用矩阵相乘实现标签传播过程。算法每次迭代都会删除小概率标签, 最终每一个节点都有唯一的可选标签, 产生一种划分结果, 算法的稳定性得到提高。但在大规模网络中, 使用矩阵相乘会消耗很大的内存和计算资源。2014年, Xing等[15]提出NIBLPA算法, 该算法使用k-shell分解方法计算每一个节点的影响力值, 依据影响力值从高到低的顺序更新标签和选取标签, 提高了算法的稳定性和准确率, 同时也增加了算法的时间复杂度。2015年, Sun等[16]提出Cen_LP算法, 该算法为每一个节点定义了节点中心性度量值, 依据节点中心性的度量值从低到高的顺序更新标签和选取标签。算法在不增加时间复杂度的情况下, 提高了社区发现的准确率。2016年, Chin等[17]提出CLPA-GNR算法, 该算法首先确定主要社区, 然后网络中的其他节点会按照限制条件被划分到主要社区中, 最后在社区之间交换节点或者从社区中划分出新社区进行调整。该算法提高了社区发现的稳定性和准确率, 但由于不断的局部搜索, 算法的时间复杂度较高。2016年, Shang等[18]提出CSCNLPA算法, 该算法在预处理阶段会进行多次迭代, 每次迭代依据节点度数搜索核心节点, 并依据节点的相似度给邻接节点更新标签, 在预处理阶段结束以后, 会依据新规则给未赋予标签的节点更新标签。该算法提高了社区发现的准确率, 但也提高了时间复杂度。2017年, Li等[19]提出Stepping LPA-S算法, 该算法利用节点与节点之间、社区与社区之间的相似度进行标签更新和社区合并。该算法提高了社区发现的准确率, 但由于不断地合并社区, 算法的时间复杂度较高。2017年, Ma等[20]提出NILPA算法, 该算法依据节点的度数计算节点重要性并且依据随机游走理论形成节点相似度矩阵, 在两者的基础上形成新的度量标准以衡量节点之间的紧密程度, 依据该度量标准更新标签。该算法提高了社区发现的准确率, 但由于计算新的度量标准, 算法的时间复杂度较高。综上所述, 上述算法在准确率上有所提高, 但是都增大了算法的时间复杂度, 降低了算法的执行速度。
为了在提高算法准确率的基础上加快算法的执行速度, 本文提出基于速度优化和社区偏向的标签传播算法(A Label Propagation Algorithm based on the Speed Optimization and Community Preference, SOCP_ LPA), 包括基于速度优化的更新过程和基于社区偏向的更新规则。SOCP_LPA在更新过程中, 用节点信息列表记录每一个节点的状态信息, 判断当前节点是否需要更新, 从而避免了不必要的更新, 提高了运行速度。在更新过程中, SOCP_LPA利用节点对社区偏向程度的更新规则, 即不仅仅考虑邻接节点中标签出现的次数, 还会评估当前节点对周围社区的偏向程度, 从而提高算法的准确率。
LPA具有无参数、易于实现、执行速度快的优点。即使运行在线性时间内, LPA仍可以通过避免不必要的更新进一步提升算法执行速度。当处理大规模网络时, 这些改进是至关重要的。
LPA开始运行时, 由于每一个节点的邻接节点都拥有独一无二的标签, 所以在更新过程中节点改变标签的概率非常高, 在几次迭代以后, 绝大部分的节点都会被正确划分(当前节点的标签已经变为邻接节点中标签出现次数最多的标签)。为了说明这一现象, 本文将LPA算法分别应用在CA-Hepth、CA-GrQc、email和PGP这4种常用的真实网络数据集上进行社区发现, 数据集详细信息如表1所示。实验中将每一个真实网络都看作无向无权图, 每一个图都没有自循环的边并且只取图的最大连通子图。实验在PyCharm工具中使用Python语言实现, 在Windows7, AMD Core A8-4555 CPU@1.6GHz, 8GB内存环境下进行。5次迭代以后, 计算已被正确划分的节点占总节点的百分比。每个数据集重复100次实验, 4种网络的收敛信息如表2所示。
表1 数据集介绍
网络名称 | 节点数 | 边数 | 描述 |
---|---|---|---|
1 133 | 5 451 | 美国大学生email社交网络[21] | |
CA-GrQc | 4 158 | 13 422 | 广义相对论研究团体网络[22] |
CA-Hepth | 8 638 | 24 807 | 高能物理学理论研究团体网络[22] |
PGP | 10 680 | 24 316 | Pretty-Good-Privacy算法研究网络[23] |
表2 网络的收敛信息
网络名称 | 5次迭代后正确划分节点 所占百分比 | 算法收敛所需的 迭代次数 |
---|---|---|
96.64% | 17.89 | |
CA-GrQc | 99.37% | 15.7 |
CA-Hepth | 98.97% | 59.75 |
PGP | 99.37% | 21.9 |
从表2可以看出, 在5次迭代以后, email网络有96.64%的节点被正确划分, CA-Hepth网络有98.97%的节点被正确划分, CA-GrQc和PGP网络有99%以上的节点都已被正确划分。这说明无论总的迭代次数是多少, LPA在5次迭代以后, 绝大多数的节点都已经得到正确的划分, 而5次迭代后的每一次迭代, 仅仅较少的节点标签会发生改变, 而对于大部分的节点来说, 即使尝试更新标签, 也不会改变标签, 这些不必要的更新拖慢了算法的收敛速度。针对这一问题, 提出基于速度优化的更新过程。
LPA在每次迭代过程中, 对节点标签的更新有两种情况, 一种是邻接节点中出现次数最多的标签只有一个, 另一种是邻接节点中出现次数最多的标签有多个。如果当前节点的标签已经是邻接节点中出现次数最多的唯一标签, 这样的节点称为被动节点, 被动节点在本次更新中不会再改变标签。如果邻接节点中出现次数最多的标签有多个, 当前节点的标签是其中之一, 这样的节点称为干扰节点, 根据LPA的更新规则, 干扰节点会随机选择其中一个标签, 干扰节点在本次更新中可能会改变标签。被动节点和干扰节点以外的其他节点是主动节点, 主动节点在本次更新中一定会改变标签。如果能避免对被动节点的标签更新, 算法会以更快的速度收敛。基于此, 本文构建一个节点信息列表, 其中包含所有的节点和每个节点的状态信息(被动、干扰和主动)。每次只选择信息列表中的干扰和主动节点进行标签更新, 标签更新完成后再进行节点状态更新。
基于速度优化的更新过程如下:
(1) 初始化节点信息列表。开始时, 每一个节点都被独一无二的标签进行标记, 都被设置为主动节点, 并被加入到节点信息列表中。
(2) 从节点信息列表中随机选择一个主动节点或干扰节点, 按照更新规则, 对当前节点的标签进行更新。标签更新完成后对该节点及其邻接节点进行状态更新, 如果节点变为被动节点, 就将其标记为被动节点; 如果变为干扰节点, 就将其标记为干扰节点。邻接节点在更新状态时, 还可能更新为主动节点。
(3) 根据节点状态判断节点信息列表中是否还有主动节点, 若有, 继续执行步骤(2), 否则, 算法结束。
LPA中标签的传播类似于人与人之间建议的传播。LPA更新一个节点的标签时, 总是选择其邻接节点中出现次数最多的标签进行更新。这里, 将节点看作人, 将标签看作人提出的建议, 一个人在采纳意见时往往会考虑提出不同意见的人数, 如果某一意见提出的人数最多, 这个人就会采纳这个意见。然而实际中, 当采纳建议时, 不仅仅会看提出该建议的人数(建议人群的大小), 还会去评估该人群中每个人的权威性, 即建议人的朋友圈大小, 和建议人与采纳人公共朋友的个数。在LPA中, 提出某一意见的人群即具有相同标签的社区, 社区中每一个节点的权威性可以通过节点的度数以及节点与待更新节点的公共邻接节点的个数来评估, 社区权威性则是社区中所有节点的权威性之和。基于这个原理, 提出基于社区偏向的标签更新规则。
对算法中使用的主要概念进行定义, 如下所示。
定义1 边缘度数比: 假设节点i和节点j是图中的两个节点, i与j的边缘度数比定义如公式(1)所示。
${{\rho }_{ij}}=({{d}_{j}}-sim(i,j)-1)/{{d}_{i}}$ (1)
其中, di表示节点i的度数, sim(i,j)表示节点i和节点j的公共邻接节点的个数。
定义2 节点的重要程度: 节点j对当前节点i的重要程度定义如公式(2)所示。
$close(i,j)=1\text{+}{{c}_{1}}sim(i,j)+{{c}_{2}}{{\rho }_{ij}}$ (2)
其中, c1和c2是取值在0到1之间的常量, ρij是边缘密度比。
定义3 社区偏向程度: 节点i对具有标签l的社区的偏向程度定义如公式(3)所示。
$P(i,l)=\sum\limits_{j\in N(i)\bigcap {{C}_{l}}}{close(i,j)}$ (3)
其中, Cl表示标签为l的节点形成的社区, N(i)表示i节点的邻接节点。
定义4 近似最大值标签: 将LPA中邻接节点中出现次数最多的标签的出现次数记为m。如果邻接节点某类标签的出现次数n和m值相差不大时, 这类标签称为近似最大值标签。算法引入一个取值在0到1之间的常量Radio进行衡量。
$\frac{m-n}{m}<Radio$ (4)
若邻接节点中标签出现次数n满足公式(4), 称之为近似最大值标签。当前节点可以在近似最大值标签中选择, 在更新过程中不考虑节点数量较少的标签, 因此Radio的值不宜设置过大, 应在0.1至0.4之间。
定义5 基于社区偏向的更新规则: 设集合T中保存了i节点的邻接节点中出现次数最多的标签和近似最大值标签, L(i)表示更新后的节点i的标签, 基于社区偏向的更新规则如公式(5)所示。
$L(i)=\underset{l\in T}{\mathop{\arg \max }}\,\{P(i,l)\}$ (5)
公式(5)是在集合T中, 让节点i选择使得$P(i,l)$最大的标签(社区), 如果有多个标签使得$P(i,l)$最大, 则从中随机选取。
本文提出基于速度优化的更新过程和基于社区偏向的更新规则, 在基于速度优化的更新过程中引入基于社区偏向的更新规则, 形成基于速度优化和社区偏向的标签传播算法(SOCP_LPA), 算法的伪代码如下。
Input: G (V, E)、c1、c2、Radio/*输入图的邻接矩阵和公式(2)、公式(4)的参数
Output: 社区划分结果
1. 为每个节点分配一个独一无二的标签;
2. 将所有的节点标记为主动节点并添加到节点信息列表中;
3. 从节点信息列表中随机选取一个主动节点或干扰节点, 记为i;
4. 将当前节点i的邻接节点中出现次数最多的标签和近似最大值标签加入集合T;
5. 初始化maxLabel集合为空; maxLabel记录与当前节点i偏向程度最大的一个或多个标签;
6. for each l∈T do
7. maxP ← 0;
8. 依据公式(3)计算P(i,l);
9. if P(i,l) == maxP then
10. 将l标签添加到maxLabel中;
11. else if P(i,l) > maxP then
12. maxP ← P(i,l);
13. 从maxLabel中移除所有元素;
14. 将l标签添加到maxLabel中;
15. end if
16. end for
17. 当前节点i从maxLabel集合中随机选取标签;
18. for each j∈N(i)∩i do /*更新当前节点i和i的邻接节点的状态;
19. if node j is interference node then
20. 标记j为干扰节点;
21. else if node j is passive node then
22. 标记j为被动节点;
23. else
24. 标记j为主动节点;
25. end if
26. end for
27. 如果节点信息列表仍含有主动节点, 跳到第3行;
28. 将具有相同标签的节点划分到一个社区中, 算法结束。
SOCP_LPA算法的核心是对节点信息列表中的主动节点和干扰节点进行标签更新和状态更新, 更新当前节点的标签和更新标签状态的时间复杂度均是O(di), di是当前节点i的度数。由于算法仅更新节点信息列表中的主动节点和干扰节点, 因此在整个更新过程中, 时间复杂度是O(kd), k是对主动节点和干扰节点更新的总次数, d是网络的平均度数。
为了验证SOCP_LPA算法在社区发现的速度和准确率上的提高, 除了LPA算法, 还选取NIBLPA[15]和Cen_LP[16]两种较好的LPA改进算法进行对比。其中公式(2)中c1值设为1, c2值设为0.2, 公式(4)中Radio值设为0.4。
(1) 真实网络
为了验证SOCP_LPA算法社区发现的准确率, 除了使用表1中的4种较大规模真实网络, 还选取了4种较小规模的真实网络, 具体信息如表3所示。这些真实网络是社区发现领域常用的数据集。
表3 数据集介绍
网络名称 | 节点数 | 边数 | 描述 |
---|---|---|---|
karate | 34 | 78 | 美国空手道俱乐部网络[24] |
dolphins | 62 | 159 | 海豚数据网络[18] |
polbooks | 105 | 441 | 亚马逊美国政治书销售网络[25] |
football | 115 | 613 | 美国秋季大学生足球队网络[26] |
(2) LFR基准网络
LFR基准网络是由Lancichinetti等[27]提出的专门用于生成模拟网络的算法产生, 该算法主要根据输入的参数, 尽可能生成符合真实网络特征的人工合成网络, 因此LFR基准网络具有较高的实验价值, 是当前社区发现研究中最为常用的模拟网络。在生成网络时常用的参数如表4所示, 其中, mu表示节点与社区外部连接的概率, 其值在0到1之间, mu值越大, 表明社区的结构越不清晰。按照表4的参数, 本实验生成的LFR基准网络的详细介绍如表5所示, 其中包含4组LFR基准网络, 每组包含15个不同mu值的网络, mu值在0.1至0.8之间, 间隔为0.05。相比1000S(5000S)网络, 1000B(5000B)网络中每一社区内的节点个数较多。
表5 LFR基准网络数据集描述
网络名称 | N | k | maxk | minc | maxc | mu |
---|---|---|---|---|---|---|
1000S | 1 000 | 10 | 50 | 10 | 50 | 0.1~0.8 |
1000B | 1 000 | 10 | 50 | 20 | 100 | 0.1~0.8 |
5000S | 5 000 | 10 | 50 | 10 | 50 | 0.1~0.8 |
5000B | 5 000 | 10 | 50 | 20 | 100 | 0.1~0.8 |
(1) 模块度
模块度是一种重要的衡量网络社区结构强度的方法, 用于真实网络社区发现结果的衡量。模块度[3]的计算如公式(6)所示。
$Q=\sum\limits_{c\in L}{\frac{{{L}_{c}}}{m}}-{{(\frac{{{D}_{c}}}{2m})}^{2}}$ (6)
其中, Lc表示社区c内部的链接数, m表示整个网络的链接数, 即边的个数, Dc表示社区c内部节点的度数总和, L表示整个网络的全部社区。
(2) 归一化互信息和F-measure
由于LFR基准网络已知真实的划分结果, 所以选取归一化互信息(Normal Mutual Information, NMI) [28]和F-measure[29]来评估其算法准确性。
NMI是社区发现领域常用的评价指标, 可以在LFR基准网络上有效评价社区发现的准确性。NMI的取值在0至1之间, 其计算如公式(7)所示。
$NMI(A,B)=\frac{-2\sum\limits_{i=1}^{{{C}_{A}}}{\sum\limits_{j=1}^{{{C}_{B}}}{{{N}_{ij}}\log (\frac{{{N}_{ij}}n}{{{N}_{i\cdot }}{{N}_{\cdot j}}})}}}{\sum\limits_{i=1}^{{{C}_{A}}}{{{N}_{i\cdot }}\log (\frac{{{N}_{i\cdot }}}{n})+\sum\limits_{j=1}^{{{C}_{B}}}{{{N}_{\cdot j}}\log (\frac{{{N}_{\cdot j}}}{n})}}}$ (7)
其中, A是网络的真实划分, B是算法对网络所做出的划分, CA表示A种划分的社区个数, CB表示B种划分的社区个数, N表示一个矩阵, Nij示A种划分第i个社区中的节点出现在B种划分第j个社区中的节点个数。${{N}_{i\cdot }}$表示矩阵N第i行的求和。${{N}_{\cdot j}}$表示矩阵N第j列的求和。n表示整个网络节点的个数。如果A种划分和B种划分是相同的, 则NMI(A,B)值为1。NMI值越高, 表示算法社区发现的准确率越高。
F-measure是一种常用于评价分类模型好坏的统计量, 可以在LFR基准网络上有效评价社区发现的准确性, 其计算公式如公式(8)所示。
$F\text{-}measure=\frac{2\times Precision\times Recall}{Precision+Recall}$ (8)
其中, Precision和Recall的定义如公式(9)和公式(10)所示。
$Precision\text{=}\frac{\left| S\bigcap T \right|}{\left| S \right|}$ (9)
$Recall\text{=}\frac{\left| S\bigcap T \right|}{\left| T \right|}$ (10)
T表示节点对(i, j)的集合, 在网络的真实划分中, 节点i和节点j属于同一社区。S也表示节点对(i, j)的集合, 在算法对网络所做出的划分中, 节点i和节点j属于同一社区。$S\bigcap T$表示S集合和T集合的节点对的交集。当真实划分和算法所做出的划分相同时, F-measure值为1。F-measure值越高, 表示算法社区发现的准确率越高。
(1) 时间复杂度对比
LPA、Cen_LP和NIBLPA算法的核心是进行多次迭代, 每一次迭代会更新所有节点的标签, 每一个节点的标签都要更新为邻接节点中出现次数最多的标签, 因此每一次迭代的时间复杂度为O(nd), n为网络的节点个数, d是网络的平均度数。由于三种算法需要进行多次迭代, 因此三种算法的总体时间复杂度均为O(tnd), t为算法的迭代次数。SOCP_ LPA算法的时间复杂度为O(kd), 由于有效更新的次数k明显小于算法的迭代次数t和节点个数n的乘积, 因此SOCP_LPA的时间复杂度明显小于其他三种算法。
(2) 迭代次数对比
将SOCP_LPA、LPA、NIBLPA和Cen_LP分别应用到真实网络数据集中进行社区发现, 计算迭代次数, 每一个算法的终止条件都是节点标签成为邻接节点中出现次数最多的标签。为了具有可比性, SOCP_LPA算法的迭代次数记为算法总更新次数和节点总个数的比值, 对每一个网络重复100次实验求平均的迭代次数, 得到的结果如图1所示, 其中横坐标表示真实网络, 纵坐标表示迭代次数。
从图1可以直观看出, SOCP_LPA的迭代次数明显要比其他三种算法少很多。与LPA相比的图1(a)中, 在前4个小规模网络上, SOCP_LPA的迭代次数均是LPA的1/3 左右, 在email网络、CA-GrQc网络上, SOCP_LPA迭代次数均是LPA的1/12左右, 而在CA-Hepth和PGP网络上, SOCP_LPA迭代次数分别是LPA的1/50和1/20左右。与NIBLPA相比的图1(b)中, SOCP_LPA在4个小规模网络上的迭代次数是NIBLPA的1/3至1/5左右, 在email、CA-GrQc、CA-Hepth网络上, SOCP_LPA的迭代次数是NIBLPA的1/12至1/9左右, 而在PGP网络上, SOCP_LPA的迭代次数为NIBLPA的1/16左右。与Cen_LP相比的图1(c)中, 4种小规模网络上, SOCP_LPA的迭代次数均是Cen_LP的1/3左右, 在email、CA-GrQc和PGP网络中, SOCP_LPA的迭代次数是Cen_LP的1/12至1/8左右, 而在CA-Hepth数据集上, SOCP_LPA的迭代次数是Cen_LP的1/30左右。综上所述, 与其他三种算法相比, SOCP_LPA明显提高了算法的执行速度, 并且网络规模越来越大时, 算法的速度优势越来越明显。
将4种算法应用在8种真实网络数据集上进行社区发现, 并在平均模块度和最高模块度方面进行对比。平均模块度是对每一个真实网络做100次实验求取模块度的平均值, 最高模块度是100次实验中的最优结果。平均模块度的对比如表6所示, 最高模块度的对比如表7所示, 其中加黑的数据表明算法中对应每一个网络的最大模块度。
表6 平均模块度对比
网络名称 | SOCP_LPA | LPA | Cen_LP | NIBLPA |
---|---|---|---|---|
karate | 0.371 | 0.345 | 0.416 | 0.423 |
dolphins | 0.523 | 0.483 | 0.523 | 0.521 |
polbooks | 0.518 | 0.493 | 0.518 | 0.497 |
football | 0.573 | 0.588 | 0.589 | 0.582 |
0.497 | 0.234 | 0.427 | 0.427 | |
CA-GrQc | 0.788 | 0.775 | 0.740 | 0.710 |
CA-Hepth | 0.671 | 0.628 | 0.621 | 0.610 |
PGP | 0.819 | 0.802 | 0.750 | 0.783 |
表7 最高模块度对比
网络名称 | SOCP_LPA | LPA | Cen_LP | NIBLPA |
---|---|---|---|---|
karate | 0.371 | 0.416 | 0.416 | 0.423 |
dolphins | 0.526 | 0.527 | 0.526 | 0.521 |
polbooks | 0.523 | 0.526 | 0.523 | 0.497 |
football | 0.603 | 0.605 | 0.601 | 0.582 |
0.545 | 0.443 | 0.536 | 0.427 | |
CA-GrQc | 0.796 | 0.783 | 0.743 | 0.710 |
CA-Hepth | 0.685 | 0.647 | 0.633 | 0.610 |
PGP | 0.827 | 0.816 | 0.756 | 0.783 |
从表6可以看出, 除了karate、football网络以外, SOCP_LPA在其他网络上都取得了最好的结果, 尤其在email、CA-GrQc、CA-Hepth和PGP等4种较大规模的网络上优势更加明显。从表7可以看出, 在4种小规模网络上, SOCP_LPA的最高模块度都接近于最好的结果, 在4种较大规模网络上, SOCP_LPA算法的最高模块度明显高于其他三种算法。综上所述, SOCP_LPA算法相比其他三种算法提高了社区发现的准确率, 尤其对于大规模网络效果更好, 准确率更高。
本实验将4种算法应用在表5中的LFR基准网络数据集上进行社区发现, 并在归一化互信息 (NMI)和F-measure两个方面进行对比。对每一个网络重复100次实验求取平均NMI和F-measure值。
NMI对比的实验结果如图2所示, 其中横坐标表示不同mu值的LFR基准网络, 纵坐标表示NMI值。
随着mu值不断增大, 社区结构越来越不清晰, 社区发现的难度在增大, 4种算法的社区发现准确率也在明显下降。从图2(a)看出, 当mu值在0.1至0.4之间时, 4种算法都能较好地发现社区结构且SOCP_LPA的NMI值较高。mu值在0.4至0.8之间时, SOCP_LPA的NMI值明显高于其他三种算法。从图2(b)可以看出, mu值在0.65至0.75之间时, SOCP_LPA算法略低于NIBLPA, 但在其mu值的网络上, SOCP_LPA算法的NMI值都是4种算法中最高的。从图2(c)可以看出, 当mu值大于0.75时, Cen_LP和LPA算法已经无法发现社区结构, 而NIBLPA算法的准确度也明显低于SOCP_LPA。从图2(d)可以看出, 当mu值大于0.55时, SOCP_LPA算法相比其他三种算法优势较为明显。
F-measure对比如图3所示, 其中横坐标表示不同mu值的网络, 纵坐标表示F-measure值。
从图3(a)和图3(c)可以看出, mu值在0.1至0.6之间时SOCP_LPA略高于其他三种算法, mu值在0.6至0.8之间时, SOCP_LPA算法的F-measure值明显高于其他三种算法。从图3(b)和图3(d)可以看出, 对于1000B和5000B网络, 虽然SOCP_LPA在mu值为0.4左右时有波动, 但是整体上SOCP_LPA算法的F-measure值仍是4种算法最高的。
综上所述, SOCP_LPA算法相比其他三种算法提高了社区发现的准确率, 尤其对于mu值较大的社区结构不清晰的网络, 算法的优势更加明显。
本文提出基于速度优化和社区偏向的标签传播算法, 首先, 该算法引入节点信息列表, 通过记录每一个节点的更新状态, 避免了许多不必要的更新, 从而提高了计算速度。其次, 算法引入社区偏向的更新规则, 节点更可能加入重要的社区, 提高了社区发现的准确率。最后, 将新的更新过程和新的更新规则结合形成本文提出的基于速度优化和社区偏向的标签传播算法, 与目前几种改进的LPA算法相比, 本文提出的算法在速度和准确率上均有所提升, 尤其在处理较大和较复杂的网络时, 算法的优势更加明显。尽管算法在速度和准确率上均有所提升, 但算法的更新顺序仍具有随机性, 寻求一种较好的更新顺序是下一步的研究方向。
张素琪, 霍士杰, 顾军华: 提出研究思路, 设计研究方案;
霍士杰: 进行实验, 论文起草及最终版本修订;
高星, 郭京津: 采集、清洗和分析数据, 论文最终版本审核。
所有作者声明不存在利益冲突关系。
支撑数据由作者自存储, E-mail: zhangsuqie238@163.com。
[1] 张素琪. 算法代码.zip. 算法实现.
[2] 张素琪. 实验数据.zip. 实验过程数据.
[1] |
Defining and Identifying Communities in Networks [J].https://doi.org/10.1073/pnas.0400054101 URL [本文引用: 1] |
[2] |
Community Structure in Social and Biological Networks [J].https://doi.org/10.1073/pnas.122653799 URL [本文引用: 1] |
[3] |
Fast Algorithm for Detecting Community Structure in Networks [J].https://doi.org/10.1103/PhysRevE.69.066133 URL PMID: 15244693 [本文引用: 2] 摘要
Abstract Many networks display community structure--groups of vertices within which connections are dense but between which they are sparser--and sensitive computer algorithms have in recent years been developed for detecting this structure. These algorithms, however, are computationally demanding, which limits their application to small networks. Here we describe an algorithm which gives excellent results when tested on both computer-generated and real-world networks and is much faster, typically thousands of times faster, than previous algorithms. We give several example applications, including one to a collaboration network of more than 50,000 physicists.
|
[4] |
Computing Communities in Large Networks Using Random Walks [J].https://doi.org/10.1007/11569596_31 URL [本文引用: 1] 摘要
Abstract: Dense subgraphs of sparse graphs (communities), which appear in most real-world complex networks, play an important role in many contexts. Computing them however is generally expensive. We propose here a measure of similarities between vertices based on random walks which has several important advantages: it captures well the community structure in a network, it can be computed efficiently, it works at various scales, and it can be used in an agglomerative algorithm to compute efficiently the community structure of a network. We propose such an algorithm which runs in time O(mn^2) and space O(n^2) in the worst case, and in time O(n^2log n) and space O(n^2) in most real-world cases (n and m are respectively the number of vertices and edges in the input graph). Experimental evaluation shows that our algorithm surpasses previously proposed ones concerning the quality of the obtained community structures and that it stands among the best ones concerning the running time. This is very promising because our algorithm can be improved in several ways, which we sketch at the end of the paper.
|
[5] |
Community Detection Based on Modularity and an Improved Genetic Algorithm [J].https://doi.org/10.1016/j.physa.2012.11.003 URL [本文引用: 1] 摘要
78 We propose a modularity and improved genetic algorithm (MIGA). 78 MIGA takes the modularity Q as the objective function. 78 MIGA uses prior information: the number of community structures. 78 MIGA takes the simulated annealing method as the local search method. 78 Simulation results reflect the effectiveness of MIGA.
|
[6] |
On Modularity Clustering [J].https://doi.org/10.1109/TKDE.2007.190689 URL [本文引用: 1] |
[7] |
A Fast Parallel Modularity Optimization Algorithm (FPMQA) for Community Detection in Online Social Network [J].https://doi.org/10.1016/j.knosys.2013.06.014 URL [本文引用: 1] 摘要
As information technology has advanced, people are turning more frequently to electronic media for communication, and social relationships are increasingly found in online channels. Discovering the latent communities therein is a useful way to better understand the properties of a virtual social network. Traditional community-detection tasks only consider the structural characteristics of a social organization, but more information about nodes and edges such as semantic information cannot be exploited. What is more, the typical size of virtual spaces is now counted in millions, if not billions, of nodes and edges, most existing algorithms are incapable to analyze such large scale dense networks.In this paper, we first introduce an interesting social network model (Interest Network) in which links between two IDs are built if they both participate to the discussions about one or more topics/stories. In this case, we say both of the connected two IDs have the similar interests. Then, the edges of the initial network are updated using the attitude consistency information of the connected ID pairs. For a given ID pair i and j, they may together reply to some topics/IDs. The implicit orientations/attitudes of these two IDs to their together-reply topics/IDs may not be the same. We use a simple statistical method to calculate the attitude consistency, the value of which is between 0 and 1, and the higher value corresponds to a greater degree of consistency of the given ID pair to topics/IDs. The updated network is called Similar-View Network (SVN). In the second part, a fast parallel modularity optimization algorithm (FPMQA) that performs the analogous greedy optimization as CNM and FUC is used to conduct community discovering. By using the parallel manner and sophisticated data structures, its running time is essentially fast, O(k(max) (k(max) + (k) log k(max))). Finally, we propose an evaluation metric, which is based on the reliable ground truths, for online network community detection. In the experimental work, we evaluate our method using real datasets and compare our approach with several previous methods; the results show that our method is more effective and accurate in find potential online communities. (C) 2013 Elsevier B.V. All rights reserved.
|
[8] |
Normalized Modularity Optimization Method for Community Identification with Degree Adjustment [J].https://doi.org/10.1103/PhysRevE.88.052802 URL PMID: 24329313 [本文引用: 1] 摘要
As a fundamental problem in network study, community identification has attracted much attention from different fields. Representing a seminal work in this area, the modularity optimization method has been widely applied and studied. However, this method has issues in resolution limit and extreme degeneracy and may not perform well for networks with unbalanced structures. Although several methods have been proposed to overcome these limitations, they are all based on the original idea of defining modularity through comparing the total number of edges within the putative communities in the observed network with that in an equivalent randomly generated network. In this paper, we show that this modularity definition is not suitable to analyze some networks such as those with unbalanced structures. Instead, we propose to define modularity through the average degree within the communities and formulate modularity as comparing the sum of average degree within communities of the observed network to that of an equivalent randomly generated network. In addition, we also propose a degree-adjusted approach for further improvement when there are unbalanced structures. We analyze the theoretical properties of our degree adjusted method. Numerical experiments for both artificial networks and real networks demonstrate that average degree plays an important role in network community identification, and our proposed methods have better performance than existing ones.
|
[9] |
A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network [C]// |
[10] |
Spectral Methods for the Detection of Network Community Structure: A Comparative Analysis [J].https://doi.org/10.1088/1742-5468/2010/10/P10020 URL [本文引用: 1] 摘要
Spectral analysis has been successfully applied at the detection of community structure of networks, respectively being based on the adjacency matrix, the standard Laplacian matrix, the normalized Laplacian matrix, the modularity matrix, the correlation matrix and several other variants of these matrices. However, the comparison between these spectral methods is less reported. More importantly, it is still unclear which matrix is more appropriate for the detection of community structure. This paper answers the question through evaluating the effectiveness of these five matrices against the benchmark networks with heterogeneous distributions of node degree and community size. Test results demonstrate that the normalized Laplacian matrix and the correlation matrix significantly outperform the other three matrices at identifying the community structure of networks. This indicates that it is crucial to take into account the heterogeneous distribution of node degree when using spectral analysis for the detection of community structure. In addition, to our surprise, the modularity matrix exhibits very similar performance to the adjacency matrix, which indicates that the modularity matrix does not gain desired benefits from using the configuration model as reference network with the consideration of the node degree heterogeneity.
|
[11] |
Detecting Network Communities Using Regularized Spectral Clustering Algorithm [J].https://doi.org/10.1007/s10462-012-9325-3 URL [本文引用: 1] 摘要
The progressively scale of online social network leads to the difficulty of traditional algorithms on detecting communities. We introduce an efficient and fast algorithm to detect community structure in social networks. Instead of using the eigenvectors in spectral clustering algorithms, we construct a target function for detecting communities. The whole social network communities will be partitioned by this target function. We also analyze and estimate the generalization error of the algorithm. The performance of the algorithm is compared with the standard spectral clustering algorithm, which is applied to different well-known instances of social networks with a community structure, both computer generated and from the real world. The experimental results demonstrate the effectiveness of the algorithm.
|
[12] |
Near Linear Time Algorithm to Detect Community Structures in Large-scale Networks [J].https://doi.org/10.1103/PhysRevE.76.036106 URL PMID: 17930305 [本文引用: 1] 摘要
Community detection and analysis is an important methodology for understanding the organization of various real-world networks and has applications in problems as diverse as consensus formation in social communities or the identification of functional modules in biochemical networks. Currently used algorithms that identify the community structures in large-scale real-world networks require a priori information such as the number and sizes of communities or are computationally expensive. In this paper we investigate a simple label propagation algorithm that uses the network structure alone as its guide and requires neither optimization of a predefined objective function nor prior information about the communities. In our algorithm every node is initialized with a unique label and at every step each node adopts the label that most of its neighbors currently have. In this iterative process densely connected groups of nodes form a consensus on a unique label to form communities. We validate the algorithm by applying it to networks whose community structures are known. We also demonstrate that the algorithm takes an almost linear time and hence it is computationally less expensive than what was possible so far.
|
[13] |
Learning from Labeled and Unlabeled Data with Label Propagation [R]. |
[14] |
LabelRank: A Stabilized Label Propagation Algorithm for Community Detection in Networks [C]// |
[15] |
A Node Influence Based Label Propagation Algorithm for Community Detection in Networks [J].https://doi.org/10.1155/2014/627581 URL PMID: 4066938 [本文引用: 2] 摘要
Label propagation algorithm (LPA) is an extremely fast community detection method and is widely used in large scale networks. In spite of the advantages of LPA, the issue of its poor stability has not yet been well addressed. We propose a novel node influence based label propagation algorithm for community detection (NIBLPA), which improves the performance of LPA by improving the node orders of label updating and the mechanism of label choosing when more than one label is contained by the maximum number of nodes. NIBLPA can get more stable results than LPA since it avoids the complete randomness of LPA. The experimental results on both synthetic and real networks demonstrate that NIBLPA maintains the efficiency of the traditional LPA algorithm, and, at the same time, it has a superior performance to some representative methods.
|
[16] |
CenLP: A Centrality-based Label Propagation Algorithm for Community Detection in Networks [J].https://doi.org/10.1016/j.physa.2015.05.080 URL [本文引用: 2] 摘要
Community detection is an important work for discovering the structure and features of complex networks. Many existing methods are sensitive to critical user-dependent parameters or time-consuming in practice. In this paper, we propose a novel label propagation algorithm, called CenLP (Centrality-based Label Propagation). The algorithm introduces a new function to measure the centrality of nodes quantitatively without any user interaction by calculating the local density and the similarity with higher density neighbors for each node. Based on the centrality of nodes, we present a new label propagation algorithm with specific update order and node preference to uncover communities in large-scale networks automatically without imposing any prior restriction. Experiments on both real-world and synthetic networks manifest our algorithm retains the simplicity, effectiveness, and scalability of the original label propagation algorithm and becomes more robust and accurate. Extensive experiments demonstrate the superior performance of our algorithm over the baseline methods. Moreover, our detailed experimental evaluation on real-world networks indicates that our algorithm can effectively measure the centrality of nodes in social networks.
|
[17] |
Detecting Community Structure by Using a Constrained Label Propagation Algorithm [J].https://doi.org/10.1371/journal.pone.0155320 URL PMID: 27176470 [本文引用: 1] 摘要
Abstract Community structure is considered one of the most interesting features in complex networks. Many real-world complex systems exhibit community structure, where individuals with similar properties form a community. The identification of communities in a network is important for understanding the structure of said network, in a specific perspective. Thus, community detection in complex networks gained immense interest over the last decade. A lot of community detection methods were proposed, and one of them is the label propagation algorithm (LPA). The simplicity and time efficiency of the LPA make it a popular community detection method. However, the LPA suffers from instability detection due to randomness that is induced in the algorithm. The focus of this paper is to improve the stability and accuracy of the LPA, while retaining its simplicity. Our proposed algorithm will first detect the main communities in a network by using the number of mutual neighbouring nodes. Subsequently, nodes are added into communities by using a constrained LPA. Those constraints are then gradually relaxed until all nodes are assigned into groups. In order to refine the quality of the detected communities, nodes in communities can be switched to another community or removed from their current communities at various stages of the algorithm. We evaluated our algorithm on three types of benchmark networks, namely the Lancichinetti-Fortunato-Radicchi (LFR), Relaxed Caveman (RC) and Girvan-Newman (GN) benchmarks. We also apply the present algorithm to some real-world networks of various sizes. The current results show some promising potential, of the proposed algorithm, in terms of detecting communities accurately. Furthermore, our constrained LPA has a robustness and stability that are significantly better than the simple LPA as it is able to yield deterministic results.
|
[18] |
Circularly Searching Core Nodes Based Label Propagation Algorithm for Community Detection [J].https://doi.org/10.1142/S0218001416590242 URL [本文引用: 2] 摘要
With the application of community detection in complex networks becoming more and more extensive, the application of more and more algorithms for community detection are proposed and improved. Among these algorithms, the label propagation algorithm is simple, easy to perform and its time complexity is linear, but it has a strong randomness. Small communities in the label propagation process are easy to be swallowed. Therefore, this paper proposes a method to improve the partition results of label propagation algorithm based on the pre-partition by circularly searching core nodes and assigning label for nodes according to similarity of nodes. First, the degree of each node of the network is calculated. We go through the whole network to find the nodes with the maximal degrees in the neighbors as the core nodes. Next, we assign the core nodes鈥 labels to their neighbors according to the similarity between them, which can reduce the randomness of the label propagation algorithm. Then, we arrange the nodes whose labels had not been changed as the new network and find the new core nodes. After that, we update the labels of neighbor nodes according to the similarity between them again until the end of the iteration, to complete the pre-partition. The approach of circularly searching for core nodes increases the diversity of the network partition and prevents the smaller potential communities being swallowed in the process of partition. Then, we implement the label propagation algorithm on the whole network after the pre-partition. Finally, we adopt a modified method based on the degree of membership determined by the bidirectional attraction of nodes and their neighbor communities. This method can reduce the possibility of the error in partition of few nodes. Experiments on artificial and real networks show that the proposed algorithm can accurately divide the network and get higher degree of modularity compared with five existing algorithms.
|
[19] |
Stepping Community Detection Algorithm Based on Label Propagation and Similarity [J].https://doi.org/10.1016/j.physa.2017.01.030 URL [本文引用: 1] 摘要
Community or module structure is one of the most common features in complex networks. The label propagation algorithm (LPA) is a near linear time algorithm that is able to detect community structure effectively. Nevertheless, when labeling a node, the LPA adopts the label belonging to the majority of its neighbors, which means that it treats all neighbors equally in spite of their different effects on the node. Another disadvantage of LPA is that the results it generates are not unique. In this paper, we propose a modified LPA called Stepping LPA-S, in which labels are propagated by similarity. Furthermore, our algorithm divides networks using a stepping framework, and uses an evaluation function proposed in this paper to select the final unique partition. We tested this algorithm on several artificial and real-world networks. The results show that Stepping LPA-S can obtain accurate and meaningful community structure without priori information.
|
[20] |
An Improved Label Propagation Algorithm Based on Node Importance and Random Walk for Community Detection [J].https://doi.org/10.1142/S0217984917501627 URL [本文引用: 1] 摘要
Tianren Ma and Zhengyou Xia, Mod. Phys. Lett. B 31, 1750162 (2017) [18 pages] https://doi.org/10.1142/S0217984917501627
|
[21] |
An Improved Label Propagation Algorithm Based on the Similarity Matrix Using Random Walk [J]. |
[22] |
Graph Evolution: Densification and Shrinking Diameters [J].https://doi.org/10.1145/1217299 URL [本文引用: 2] |
[23] |
Label Propagation Algorithm Based on Local Cycles for Community Detection [J].https://doi.org/10.1142/S0217979215500290 URL [本文引用: 1] 摘要
Label propagation algorithm (LPA) has been proven to be an extremely fast method for community detection in large complex networks. But an important issue of the algorithm has not yet been properly addressed that random update orders in label propagation process hamper the algorithm robustness of algorithm. We note that when there are multiple maximal labels among a node neighbors' labels, choosing a node' label from which there is a local cycle to the node instead of a random node' label can avoid the labels propagating among communities at random. In this paper, an improved LPA based on local cycles is given. We have evaluated the proposed algorithm on computer-generated networks with planted partition and some real-world networks whose community structure are already known. The result shows that the performance of the proposed approach is even significantly improved.
|
[24] |
A Core Leader Based Label Propagation Algorithm for Community Detection [J]. |
[25] |
Multi-objective Clustering Technique Based on K-nodes Update Policy and Similarity Matrix for Mining Communities in Social Networks [J].https://doi.org/10.1016/j.physa.2017.05.026 URL [本文引用: 1] |
[26] |
Evidential Label Propagation Algorithm for Graphs [C] // |
[27] |
Benchmark Graphs for Testing Community Detection Algorithms [J].https://doi.org/10.1103/PhysRevE.78.046110 URL PMID: 18999496 [本文引用: 1] 摘要
Abstract: Community structure is one of the most important features of real networks and reveals the internal organization of the nodes. Many algorithms have been proposed but the crucial issue of testing, i.e. the question of how good an algorithm is, with respect to others, is still open. Standard tests include the analysis of simple artificial graphs with a built-in community structure, that the algorithm has to recover. However, the special graphs adopted in actual tests have a structure that does not reflect the real properties of nodes and communities found in real networks. Here we introduce a new class of benchmark graphs, that account for the heterogeneity in the distributions of node degrees and of community sizes. We use this new benchmark to test two popular methods of community detection, modularity optimization and Potts model clustering. The results show that the new benchmark poses a much more severe test to algorithms than standard benchmarks, revealing limits that may not be apparent at a first analysis.
|
[28] |
Comparing Community Structure Identification [J]. |
[29] |
Semisupervised Clustering for Networks Based on Fast Affinity Propagation [J].https://doi.org/10.1155/2013/385265 URL [本文引用: 1] 摘要
Most of the existing clustering algorithms for networks are unsupervised, which cannot help improve the clustering quality by utilizing a small number of prior knowledge. We propose a semisupervised clustering algorithm for networks based on fast affinity propagation (SCAN-FAP), which is essentially a kind of similarity metric learning method. Firstly, we define a new constraint similarity measure integrating the structural information and the pairwise constraints, which reflects the effective similarities between nodes in networks. Then, taking the constraint similarities as input, we propose a fast affinity propagation algorithm which keeps the advantages of the original affinity propagation algorithm while increasing the time efficiency by passing only the messages between certain nodes. Finally, by extensive experimental studies, we demonstrate that the proposed algorithm can take fully advantage of the prior knowledge and improve the clustering quality significantly. Furthermore, our algorithm has a superior performance to some of the state-of-art approaches.
|
/
〈 |
|
〉 |