Finding components of very large graph

问题

I have a very large graph represented in a text file of size about 1TB with each edge as follows.

From-node to-node

I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components

Is there any way to do this without loading the whole thing into memory?

回答1:

If even the number of nodes is too large to fit in memory, you can divide and conquer and use external memory sorts to do most of the work for you (e.g. the sort command included with Windows and Unix can sort files much larger than memory):

Choose some threshold vertex k.
Read the original file and write each of its edges to one of 3 files:
- To a if its maximum-numbered vertex is < k
- To b if its minimum-numbered vertex is >= k
- To c otherwise (i.e. if it has one vertex < k and one vertex >= k)
If a is small enough to solve (find connected components for) in memory (using e.g. Peter de Rivaz's algorithm) then do so, otherwise recurse to solve it. The solution should be a file whose lines each consist of two numbers x y and which is sorted by x. Each x is a vertex number and y is its representative -- the lowest-numbered vertex in the same component as x.
Do likewise for b.
Sort edges in c by their smallest-numbered endpoint.
Go through each edge in c, renaming the endpoint that is < k (remember, there must be exactly one such endpoint) to its representative, found from the solution to the subproblem a. This can be done efficiently by using a linear-time merge algorithm to merge with the solution to the subproblem a. Call the resulting file d.
Sort edges in d by their largest-numbered endpoint. (The fact that we have already renamed the smallest-numbered endpoint doesn't make this unsafe, since renaming can never increase a vertex's number.)
Go through each edge in d, renaming the endpoint that is >= k to its representative, found from the solution to the subproblem b using a linear-time merge as before. Call the resulting file e.
Solve e. (As with a and b, do this directly in memory if possible, otherwise recurse. If you need to recurse, you will need to find a different way of partitioning the edges, since all the edges in e already "straddle" k. You could for example renumber vertices using a random permutation of vertex numbers, recurse to solve the resulting problem, then rename them back.) This step is necessary because there could be an edge (1, k), another edge (2, k+1) and a third edge (2, k), and this will mean that all vertices in the components 1, 2, k and k+1 need to be combined into a single component.
Go through each line in the solution for subproblem a, updating the representative for this vertex using the solution to subproblem e if necessary. This can be done efficiently using a linear-time merge. Write out the new list of representatives (which will already be sorted by vertex number due to the fact that we created it from a's solution) to a file f.
Do likewise for each line in the solution for subproblem b, creating file g.
Concatenate f and g to produce the final answer. (For better efficiency, just have step 11 append its results directly to f).

All the linear-time merge operations used above can read directly from disk files, since they only ever access items from each list in increasing order (i.e. no slow random access is needed).

回答2:

If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.

This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.

For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).

Here is some Python code that implements a simple in-memory version of disjoint set forests:

N=7 # Number of nodes
rank=[0]*N
parent=range(N)

def Find(x):
    """Find representative of connected component"""
    if  parent[x] != x:
        parent[x] = Find(parent[x])
    return parent[x]

def Union(x,y):
    """Merge sets containing elements x and y"""
    x = Find(x)
    y = Find(y)
    if x == y:
        return
    if rank[x]<rank[y]:
        parent[x] = y
    elif rank[x]>rank[y]:
        parent[y] = x
    else:
        parent[y] = x
        rank[x] += 1

with open("disjointset.txt","r") as fd:
    for line in fd:
        fr,to = map(int,line.split())
        Union(fr,to)

for n in range(N):
    print n,'is in component',Find(n)

If you apply it to the text file called disjointset.txt containing:

it prints

0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6

You could save memory by not using the rank array, at the cost of potentially increased computation time.

回答3:

External memory graph traversal is tricky to get performant. I advise against writing your own code, implementation details make the difference between a runtime of a few hours and a runtime of a few months. You should consider using existing libraries like the stxxl. See here for a paper using it to compute connected components.

来源：https://stackoverflow.com/questions/18363348/finding-components-of-very-large-graph

标签

python

algorithm