Rejecting isomorphisms from collection of graphs

前端 未结 6 1424
刺人心
刺人心 2020-12-30 03:06

I have a collection of 15M (Million) DAGs (directed acyclic graphs - directed hypercubes actually) that I would like to remove isomorphisms from. What is the common algorith

相关标签:
6条回答
  • 2020-12-30 03:51

    Here is a breakdown of McKay ’ s Canonical Graph Labeling Algorithm, as presented in the paper by Hartke and Radcliffe [link to paper].

    I should start by pointing out that an open source implementation is available here: nauty and Traces source code.

    Ok, let's do this! Unfortunately this algorithm is heavy in graph theory, so we need some terms. First I will start by defining isomorphic and automorphic.

    Isomorphism:

    Two graphs are isomorphic if they are the same, except that the vertices are labelled differently. The following two graphs are isomorphic.

    Isomorphic graphs

    Automorphic:

    Two graphs are automorphic if they are completely the same, including the vertex labeling. The following two graphs are automorphic. This seems trivial, but turns out to be important for technical reasons.

    Automorphic graphs

    Graph Hashing:

    The core idea of this whole thing is to have a way to hash a graph into a string, then for a given graph you compute the hash strings for all graphs which are isomorphic to it. The isomorphic hash string which is alphabetically (technically lexicographically) largest is called the "Canonical Hash", and the graph which produced it is called the "Canonical Isomorph", or "Canonical Labelling".

    With this, to check if any two graphs are isomorphic you just need to check if their canonical isomporphs (or canonical labellings) are equal (ie are automorphs of each other). Wow jargon! Unfortuntately this is even more confusing without the jargon :-(

    The hash function we are going to use is called i(G) for a graph G: build a binary string by looking at every pair of vertices in G (in order of vertex label) and put a "1" if there is an edge between those two vertices, a "0" if not. This way the j-th bit in i(G) represents the presense of absence of that edge in the graph.

    McKay ’ s Canonical Graph Labeling Algorithm

    The problem is that for a graph on n vertices, there are O( n! ) possible isomorphic hash strings based on how you label the vertices, and many many more if we have to compute the same string multiple times (ie automorphs). In general we have to compute every isomorph hash string in order to find the biggest one, there's no magic sort-cut. McKay's algorithm is a search algorithm to find this canonical isomoprh faster by pruning all the automorphs out of the search tree, forcing the vertices in the canonical isomoprh to be labelled in increasing degree order, and a few other tricks that reduce the number of isomorphs we have to hash.

    (1) Sect 4: the first step of McKay's is to sort vertices according to degree, which prunes out the majority of isomoprhs to search, but is not guaranteed to be a unique ordering since there may be more than one vertex of a given degree. For example, the following graph has 6 vertices; verts {1,2,3} have degree 1, verts {4,5} have degree 2 and vert {6} has degree 3. It's partial ordering according to vertex degree is {1,2,3|4,5|6}.

    vertex degree example

    (2) Sect 5: Impose artificial symmetry on the vertices which were not distinguished by vertex degree; basically we take one of the groups of vertices with the same degree, and in turn pick one at a time to come first in the total ordering (fig. 2 in the paper), so in our example above, the node {1,2,3|4,5|6} would have children { {1|2,3|4,5|6}, {2|1,3|4,5|6}}, {3|1,2|4,5|6}} } by expanding the group {1,2,3} and also children { {1,2,3|4|5|6}, {1,2,3|5|4|6} } by expanding the group {4,5}. This splitting can be done all the way down to the leaf nodes which are total orderings like {1|2|3|4|5|6} which describe a full isomorph of G. This allows us to to take the partial ordering by vertex degree from (1), {1,2,3|4,5|6}, and build a tree listing all candidates for the canonical isomorph -- which is already a WAY fewer than n! combinations since, for example, vertex 6 will never come first. Note that McKay evaluates the children in a depth-first way, starting with the smallest group first, this leads to a deeper but narrower tree which is better for online pruning in the next step. Also note that each total ordering leaf node may appear in more than one subtree, there's where the pruning comes in!

    (3) Sect. 6: While searching the tree, look for automorphisms and use that to prune the tree. The math here is a bit above me, but I think the idea is that if you discover that two nodes in the tree are automorphisms of each other then you can safely prune one of their subtrees because you know that they will both yield the same leaf nodes.

    I have only given a high-level description of McKay's, the paper goes into a lot more depth in the math, and building an implementation will require an understanding of this math. Hopefully I've given you enough context to either go back and re-read the paper, or read the source code of the implementation.

    0 讨论(0)
  • 2020-12-30 03:57

    This is an interesting question which I do not have an answer for! Here is my two cents:

    By 15M do you mean 15 MILLION undirected graphs? How big is each one? Any properties known about them (trees, planar, k-trees)?

    Have you tried minimizing the number of checks by detecting false positives in advance? Something includes computing and comparing numbers such as vertices, edges degrees and degree sequences? In addition to other heuristics to test whether a given two graphs are NOT isomorphic. Also, check nauty. It may be your way to check them (and generate canonical ordering).

    0 讨论(0)
  • 2020-12-30 03:59

    This is indeed an interesting problem.

    I would approach it from the adjacency matrix angle. Two isomorphic graphs will have adjacency matrices where the rows / columns are in a different order. So my idea is to compute for each graph several matrix properties which are invariant to row/column swaps, off the top of my head:

    numVerts, min, max, sum/mean, trace (probably not useful if there are no reflexive edges), norm, rank, min/max/mean column/row sums, min/max/mean column/row norm

    and any pair of isomorphic graphs will be the same on all properties.

    You could make a hash function which takes in a graph and spits out a hash string like

    string hashstr = str(numVerts)+str(min)+str(max)+str(sum)+...
    

    then sort all graphs by hash string and you only need to do full isomorphism checks for graphs which hash the same.

    Given that you have 15 million graphs on 36 nodes, I'm assuming that you're dealing with weighted graphs, for unweighted undirected graphs this technique will be way less effective.

    0 讨论(0)
  • 2020-12-30 04:01

    If all your graphs are hypercubes (like you said), then this is trivial: All hypercubes with the same dimension are isomorphic, hypercubes with different dimension aren't. So run through your collection in linear time and throw each graph in a bucket according to its number of nodes (for hypercubes: different dimension <=> different number of nodes) and be done with it.

    0 讨论(0)
  • 2020-12-30 04:01

    Maybe you can just use McKay's implementation? It is found here now: http://pallini.di.uniroma1.it/

    You can convert your 15M graphs to the compact graph6 format (or sparse6) which nauty uses and then run the nauty tool labelg to generate the canonical labels (also in graph6 format).

    For example - removing isomorphic graphs from a set of random graphs:

    #gnp.py
    import networkx as nx
    for i in range(100000):
        graph = nx.gnp_random_graph(10,0.1)
        print nx.generate_graph6(graph,header=False)
    
    
    [nauty25r9]$ python gnp.py > gnp.g6
    [nauty25r9]$ cat gnp.g6 |./labelg |sort |uniq -c |wc -l
    >A labelg
    >Z  10000 graphs labelled from stdin to stdout in 0.05 sec.
    710
    
    0 讨论(0)
  • 2020-12-30 04:11

    since you mentioned that testing smaller groups of ~300k graphs can be checked for isomorphy I would try to split the 15M graphs into groups of ~300k nodes and run the test for isomorphy on each group

    say: each graph Gi := VixEi (Vertices x Edges)

    (1) create buckets of graphs such that the n-th bucket contains only graphs with |V|=n

    (2) for each bucket created in (1) create subbuckets such that the (n,m)-th subbucket contains only graphs such that |V|=n and |E|=m

    (3) if the groups are still too large, sort the nodes within each graph by their degrees (meaning the nr of edges connected to the node), create a vector from it and distribute the graphs by this vector

    example for (3): assume 4 nodes V = {v1, v2, v3, v4}. Let d(v) be v's degree with d(v1)=3, d(v2)=1, d(v3)=5, d(v4)=4, then find < := transitive hull ( { (v2,v1), (v1,v4), (v4,v3) } ) and create a vector depening on the degrees and the order which leaves you with

    (1,3,4,5) = (d(v2), d(v1), d(v4), d(v3)) = d( {v2, v1, v4, v3} ) = d(<)

    now you have divided the 15M graphs into buckets where each bucket has the following characteristics:

    • n nodes
    • m edges
    • each graph in the group has the same 'out-degree-vector'

    I assume this to be fine grained enough if you are expecting not to find too many isomorphisms

    cost so far: O(n) + O(n) + O(n*log(n))

    (4) now, you can assume that members inside each bucket are likely to be isomophic. you can run your isomorphic-check on the bucket and only need to compare the currently tested graph against all representants you have already found within this bucket. by assumption there shouldn't be too many, so I assume this to be quite cheap.

    at step 4 you also can happily distribute the computation to several compute nodes, which should really speed up the process

    0 讨论(0)
提交回复
热议问题