PageRank

The PageRank algorithm measures the influence of each vertex on every other vertex. PageRank influence is defined recursively: a vertex’s influence is based on the influence of the vertices which refer to it. A vertex’s influence tends to increase if (1) it has more referring vertices or if (2) its referring vertices have higher influence. The analogy to social influence is clear.

A common way of interpreting PageRank value is through the Random Network Surfer model. A vertex’s PageRank score is proportional to the probability that a random network surfer will be at that vertex at any given time. A vertex with a high PageRank score is a vertex that is frequently visited, assuming that vertices are visited according to the following Random Surfer scheme:

  • Assume a person travels or surfs across a network’s structure, moving from vertex to vertex in a long series of rounds.

  • The surfer can start anywhere. This start-anywhere property is part of the magic of PageRank, meaning the score is a truly fundamental property of the graph structure itself.

  • Each round, the surfer randomly picks one of the outward connections from the surfer’s current location. The surfer repeats this random walk for a long time.

  • But wait. The surfer doesn’t always follow the network’s connection structure. There is a probability (1-damping, to be precise), that the surfer will ignore the structure and will magically teleport to a random vertex.

For more information, see the Google paper on PageRank.

Specifications

tg_pageRank (STRING v_type, STRING e_type,  FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k = 100,   BOOL print_accum = TRUE, STRING result_attr =  "", STRING file_path = "",   BOOL display_edges = FALSE)
Characteristic Value

Result

Computes a PageRank value (FLOAT type) for each vertex.

Input Parameters

  • STRING v_type: Names of vertex type to use

  • STRING e_type: Names of edge type to use

  • FLOAT max_change: PageRank will stop iterating when the largest difference between any vertex’s current score and its previous score ≤ max_change. That is, the scores have become very stable and are changing by less than max_change from one iteration to the next.

  • INT max_iter: Maximum number of iterations.

  • FLOAT damping: Fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives.

  • INT top_k: Sort the scores highest first and output only this many scores

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store PageRank values (FLOAT) to this attribute

  • STRING file_path: If not empty, write output to this file.

  • BOOL display_edges: If true, include the graph’s edges in the JSON output, so that the full graph can be displayed.

Result Size

V = number of vertices

Time Complexity

O(E*k), E = number of edges, k = number of iterations.

The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.

Graph Types

Directed edges

Example

 # Use _ for default values
RUN QUERY tg_pageRank("Person", "Friend", 0.001, 25, 0.85, 100 _, _, _, _)

We ran pageRank on our test10 graph (using Friend edges) with the following parameter values: damping=0.85, max_change=0.001, and max_iter=25. We see that Ivy (center bottom) has the highest pageRank score (1.12). This makes sense since there are 3 neighboring persons who point to Ivy, more than for any other person. Eddie and Justin have scores of exactly 1 because they do not have any out-edges. This is an artifact of our particular version pageRank. Likewise, Alex has a score of 0.15, which is (1-damping), because Alex has no in-edges.

Visualized results of example query on social10 graph