Strongly Connected Components

A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. An SCC algorithm finds the maximal SCCs within a graph. Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm[1]. In each iteration, pick a pivot vertex v randomly, and find its descendant and predecessor sets, where descendant set D_v is the vertex reachable from v, and predecessor set P_v is the vertices which can reach v (stated another way, reachable from v through reverse edges). The intersection of these two sets is a strongly connected component SCC_v. The graph can be partitioned into 4 sets: SCC_v, descendants D_v excluding SCC_v, predecessors P_v excluding SCC, and the remainders R_v. It is proved that any SCC is a subset of one of the 4 sets [1]. Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.

The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real-world use cases [3]. We added two trimming stages to improve the performance: size-1 SCC trimming[2] and weakly connected components[3].

The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.

[1] Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.

[2] Mclendon Iii, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.

[3] Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.

Specifications

tg_scc (SET<STRING> v_type, SET<STRING> e_type, SET<STRING> rev_e_type,
  INT top_k_dist, INT output_limit, INT max_iter = 500, INT iter_wcc = 5,
  BOOL print_accum = TRUE, STRING attr= "", STRING file_path="")
Characteristic Value

Result

Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.

Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • SET<STRING> rev_e_type: Names of reverse direction edge types to use

  • INT top_k_dist: top k result in SCC distribution

  • INT output_limit: If >=0, max number of vertices to output to JSON.

  • INT max_iter: number of maximum iteration of the algorithm

  • INT iter_wcc: find weakly connected components for the active vertices in this iteration, since the largest SCCs are already found after several iterations; usually a small number(3 to 10)

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store community ID values (INT) to this attribute

  • STRING file_path: If not empty, write output to this file.

Result Size

V = number of vertices

Time Complexity

O(iter*d), d = max(diameter of components)

Graph Types

Directed edges with reverse direction edges as well

Example

We ran scc on the social26 graph. A portion of the JSON result is shown below.

[
  {
    "i": 1
  },
  {
    "trim_set.size()": 8
  },
  {
    "trim_set.size()": 5
  },
  {
    "trim_set.size()": 2
  },
  {
    "trim_set.size()": 2
  },
  {
    "trim_set.size()": 0
  },
  {
    "@@cluster_dist_heap": [
      {
        "csize": 9,
        "num": 1
      },
      {
        "csize": 1,
        "num": 17
      }
    ]
  },

The first element "i"=1 means the whole graph is processed in just one iteration. The 5 "trim_set.size()" elements mean there were 5 rounds of size-1 SCC trimming. The final "@@.cluster_dist_heap" object" reports on the size distribution of SCCs.There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.