How to aggregate matching pairs into “connected components” in Python

馋奶兔 提交于 2019-11-30 00:25:30
Gerard Rozsavolgyi

With networkX:

import networkx as nx
G1=nx.Graph()
G1.add_edges_from([("a","b"),("b","c"),("c","d"),("d","e"),("f","g")])
sorted(nx.connected_components(G1), key = len, reverse=True)

giving:

[['a', 'd', 'e', 'b', 'c'], ['f', 'g']]

You have to check the fastest algorithm now ...

OP:

This works great! I have this in my PostgreSQL database now. Just organize pairs into a two-column table, then use array_agg() to pass to PL/Python function get_connected(). Thanks.

CREATE OR REPLACE FUNCTION get_connected(
    lhs text[],
    rhs text[])
  RETURNS SETOF text[] AS
$BODY$
    pairs = zip(lhs, rhs)

    import networkx as nx
    G=nx.Graph()
    G.add_edges_from(pairs)
    return sorted(nx.connected_components(G), key = len, reverse=True)

$BODY$ LANGUAGE plpythonu;

(Note: I edited answer, as I thought showing this step might be helpful addendum, but too long for a comment.)

I don't believe (correct me if I'm wrong) that this is directly related to the largest clique problem. The definition of cliques (wikipedia) says that a clique "in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge". In this case, we want to find which nodes can reach eachother (even indirectly).

I made a little sample. It builds a graph and traverses it looking for neighbors. This should be pretty efficient since each node is only traversed once when groups are formed.

from collections import defaultdict

def get_cliques(pairs):
    # Build a graph using the pairs
    nodes = defaultdict(lambda: [])
    for a, b in pairs:
        if b is not None:
            nodes[a].append((b, nodes[b]))
            nodes[b].append((a, nodes[a]))
        else:
            nodes[a]  # empty list

    # Add all neighbors to the same group    
    visited = set()
    def _build_group(key, group):
        if key in visited:
            return
        visited.add(key)
        group.add(key)
        for key, _ in nodes[key]:
            _build_group(key, group)

    groups = []
    for key in nodes.keys():
        if key in visited: continue
        groups.append(set())
        _build_group(key, groups[-1])

    return groups

if __name__ == '__main__':
    pairs = [
        ('a', 'b'), ('b', 'c'), ('b', 'd'), # a "tree"
        ('f', None),                        # no relations
        ('h', 'i'), ('i', 'j'), ('j', 'h')  # circular
    ]
    print get_cliques(pairs)
    # Output: [set(['a', 'c', 'b', 'd']), set(['f']), set(['i', 'h', 'j'])]

If your data set is best modeled like a graph and really big, maybe a graph database such as Neo4j is appropriate?

DSM's comment made me look for set consolidation algorithms in Python. Rosetta Code has two versions of the same algorithm. Example use (the non-recursive version):

[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

# Copied from Rosetta Code
def consolidate(sets):
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

print consolidate([set(pair) for pair in pairs])
# Output: [set(['a', 'c', 'b', 'd']), set([None, 'f']), set(['i', 'h', 'j'])]

I tried an alternate implementation using dictionaries as lookups and may have gotten a small reduction in computational latency.

# Modified to use a dictionary
from collections import defaultdict

def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed

And just to convince myself that it returns the right result (get_cliques1 here is your original Python 2 solution):

>>> from cliques import *
>>> get_cliques1(pairs) # Original Python 2 solution
[Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])]
>>> get_cliques2(pairs) # Dictionary-based alternative
[['a', 'c', 'b', 'e', 'd'], ['g', 'f']]

Timing info in seconds (with 10 million repetitions):

$ python get_times.py 
get_cliques: 75.1285209656
get_cliques2: 69.9816100597

For the sake of completeness and reference, this is the full listing of both cliques.py and the get_times.py timing script:

# cliques.py
# Python 2.7
from collections import defaultdict
from sets import Set  # I moved your import out of the function to try to get closer to apples-apples

# Original Python 2 solution
def get_cliques1(pairs):

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

# Modified to use a dictionary
def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed.values()

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]


# get_times.py
# Python 2.7
from timeit import timeit

REPS = 10000000

print "get_cliques: " + str(timeit(
  stmt='get_cliques1(pairs)', setup='from cliques import get_cliques1, pairs',
  number=REPS
))
print "get_cliques2: " + str(timeit(
  stmt='get_cliques2(pairs)', setup='from cliques import get_cliques2, pairs',
  number=REPS
))

So at least in this contrived scenario, there is a measurable speedup. It's admittedly not groundbreaking, and I'm sure I left some performance bits on the table in my implementation, but maybe it will help get you thinking about other alternatives?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!