How to efficiently calculate triad census in undirected graph in python

问题

I am calculating triad census as follows for my undirected network.

import networkx as nx
G = nx.Graph()
G.add_edges_from(
    [('A', 'B'), ('A', 'C'), ('D', 'B'), ('E', 'C'), ('E', 'F'),
     ('B', 'H'), ('B', 'G'), ('B', 'F'), ('C', 'G')])

from itertools import combinations
#print(len(list(combinations(G.nodes, 3))))

triad_class = {}
for nodes in combinations(G.nodes, 3):
    n_edges = G.subgraph(nodes).number_of_edges()
    triad_class.setdefault(n_edges, []).append(nodes)
print(triad_class)

It works fine with small networks. However, now I have a bigger network with approximately 4000-8000 nodes. When I try to run my existing code with a network of 1000 nodes, it takes days to run. Is there a more efficient way of doing this?

My current network is mostly sparse. i.e. there are only few connections among the nodes. In that case, can I leave the unconnected nodes and do the computation first and later add the unconnceted nodes to the output?

I am also happy to get approximate answers without calculating every combination.

Example of triad census:

Triad census is dividing the triads (3 nodes) in to the four categories shown in the below figure.

For example consider the network below.

The triad census of the four classes are;

{3: [('A', 'B', 'C')], 
2: [('A', 'B', 'D'), ('B', 'C', 'D'), ('B', 'D', 'E')], 
1: [('A', 'B', 'E'), ('A', 'B', 'F'), ('A', 'B', 'G'), ('A', 'C', 'D'), ('A', 'C', 'E'), ('A', 'C', 'F'), ('A', 'C', 'G'), ('A', 'D', 'E'), ('A', 'F', 'G'), ('B', 'C', 'E'), ('B', 'C', 'F'), ('B', 'C', 'G'), ('B', 'D', 'F'), ('B', 'D', 'G'), ('B', 'F', 'G'), ('C', 'D', 'E'), ('C', 'F', 'G'), ('D', 'E', 'F'), ('D', 'E', 'G'), ('D', 'F', 'G'), ('E', 'F', 'G')], 
0: [('A', 'D', 'F'), ('A', 'D', 'G'), ('A', 'E', 'F'), ('A', 'E', 'G'), ('B', 'E', 'F'), ('B', 'E', 'G'), ('C', 'D', 'F'), ('C', 'D', 'G'), ('C', 'E', 'F'), ('C', 'E', 'G')]}

I am happy to provide more details if needed.

EDIT:

I was able to resolve the memory error by commenting the line #print(len(list(combinations(G.nodes, 3)))) as suggested in the answer. However, my program is still slow and takes days to run even with a network of 1000 nodes. I am looking for a more efficient way of doing this in python.

I am not limited to networkx and happy to accept answers using other libraries and languages as well.

As always I am happy to provide more details as needed.

回答1:

The idea is simple: Instead of working on the graph directly I use the adjacency matrix. I thought this would be more efficient, and it seems I was right.

In an adjacency matrix a 1 indicates there is an edge between the two nodes, for example the first row can be read as "There is a link between A and B as well as C"

From there I looked at your four types and found the following:

for type 3 there must be an edge between a N1 and N2, N1 and N3 and between N2 and N3. In the adjacency matrix we can find this by going over each row (where each row represents a node and its connections, this is N1) and find nodes it is connected to (that would be N2). Then, in the row of N2 we check all connected nodes (this is N3) and keep those where there is a positive entry in the row of N1. An example of this is "A, B, C", A has a connection to B. B has a connection to C, and A also has a connection to C
for type 2 it works almost identical to type 3. Except now we want to find a 0 for the N3 column in the row of N1. An example of this is "A, B, D". A has a connection to B, B has a 1 in the D column, but A does not.
for type 1 we just look at the row of N2 and find all columns for which both the N1 row and N2 row have a 0.
lastly, for type 0 look at all columns in the N1 row for which the entry is 0, and then check the rows for those, and find all the columns that have a 0 as well.

This code should work for you. For 1000 nodes it took me about 7 minutes (on a machine with a i7-8565U CPU) which is still relatively slow, but a far cry from the multiple days it currently takes you to run your solution. I have included the example from your pictures so you can verify the results. Your code produces a graph that is different from the example you show below by the way. The example graph in the code and the adjacency matrix both refer to the picture you have included.

The example with 1000 nodes uses networkx.generators.random_graphs.fast_gnp_random_graph. 1000 is the number of nodes, 0.1 is the probability for edge creation, and the seed is just for consistency. I have set the probability for edge creation because you mentioned your graph is sparse.

networkx.linalg.graphmatrix.adjacency_matrix: "If you want a pure Python adjacency matrix representation try networkx.convert.to_dict_of_dicts which will return a dictionary-of-dictionaries format that can be addressed as a sparse matrix."

The dictionary structure has M dictionaries (= rows) with up to M dictionaries nested in them. Note that the nested dictionaries are empty so checking for the existence of the key in them is equivalent to checking for a 1 or 0 as described above.

import time

import networkx as nx


def triads(m):
    out = {0: set(), 1: set(), 2: set(), 3: set()}
    nodes = list(m.keys())
    for i, (n1, row) in enumerate(m.items()):
        print(f"--> Row {i + 1} of {len(m.items())} <--")
        # get all the connected nodes = existing keys
        for n2 in row.keys():
            # iterate over row of connected node
            for n3 in m[n2]:
                # n1 exists in this row, all 3 nodes are connected to each other = type 3
                if n3 in row:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[3].add(t)
                # n2 is connected to n1 and n3 but not n1 to n3 = type 2
                else:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[2].add(t)
            # n1 and n2 are connected, get all nodes not connected to either = type 1
            for n3 in nodes:
                if n3 not in row and n3 not in m[n2]:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[1].add(t)
        for j, n2 in enumerate(nodes):
            if n2 not in row:
                # n2 not connected to n1
                for n3 in nodes[j+1:]:
                    if n3 not in row and n3 not in m[n2]:
                        # n3 is not connected to n1 or n2 = type 0
                        if len({n1, n2, n3}) == 3:
                            t = tuple(sorted((n1, n2, n3)))
                            out[0].add(t)
    return out


if __name__ == "__main__":
    g = nx.Graph()
    g.add_edges_from(
        [("E", "D"), ("G", "F"), ("D", "B"), ("B", "A"), ("B", "C"), ("A", "C")]
    )
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    print(_out)

    start = time.time()
    g = nx.generators.fast_gnp_random_graph(1000, 0.1, seed=42)
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    end = time.time() - start
    print(end)

回答2:

Let's check the numbers. Let n be the number of vertices, e the number of edges.

0 triads are in O(n^3)

1 triads are in O(e * n)

2 + 3 triads are in O(e)

To get the 2 + 3 triads:

For every node a:
   For every neighbor of a b:
      For every neighbor of b c:
        if a and c are connected, [a b c] is a 3 triad
        else [a b c] is a 2 triad
   remove a from list of nodes (to avoid duplicate triads)

The next step depends on what the goal is. If you just need the number of 1 and 0 triads, then this is sufficient:

Explanation:

The 1 triads are all connected nodes + 1 unconnected node, so we get the number by computing the number of connected nodes + 1 other node, and subtract the cases where the other node is connected (2 and 3 triads)

The 0 triads is just all combinations of nodes minus the other triads.

If you need to actually list the triads, you are pretty much out of luck because no matter what you do, listing the 0 triads is in O(n^3) and will kill you once the graphs get bigger.

The above algo for 2 + 3 triads is in O(e * max(# neighbors)), the other parts are in O(e + n) for counting the nodes and edges. Much better than O (n^3) which you would need to explicitely list the 0 triads. Listing the 1 triads could still be done in O(e * n).

回答3:

You program most probably crashes when you try to convert all combinations to a list: print(len(list(combinations(G.nodes, 3)))). Never do it because combinations returns an iterator that consumes a little amount of memory, but list can easily eat gigabytes of memory.
If you have sparse graph, it is more reasonable to find triads in connected components: nx.connected_components(G)
Networkx has triads submodule but looks like it will not fit you. I already modified the networkx.algorithms.triads code to return triads, not their count. You can find it here. Note that it uses DiGraphs. If you want to use it with undirected graphs, you should convert them to directed first.

回答4:

import networkx as nx
from time import sleep
from itertools import combinations


G = nx.Graph()
arr=[]
for i in range(1000):
    arr.append(str(i))

for i,j in combinations(arr, 2):
    G.add_edges_from([(i,j)])

#print(len(list(combinations(G.nodes, 3))))
triad_class = [[],[],[],[]]

for nodes in combinations(G.subgraph(arr).nodes, 3):
            n_edges = G.subgraph(nodes).number_of_edges()
            triad_class[n_edges].append(nodes)


print(triad_class)

i think using list would be fast insertion than dictionary, as dictionary grows exponentially and will take more time.

来源：https://stackoverflow.com/questions/56537560/how-to-efficiently-calculate-triad-census-in-undirected-graph-in-python

标签

python

networkx

graph-theory

network-analysis