A set union find algorithm

前端未结

关注

 5  560

I have thousands of lines of 1 to 100 numbers, every line define a group of numbers and a relationship among them. I need to get the sets of related numbers.

Little

相关标签:

5条回答

太阳男子

2020-12-03 02:05

As Jim pointed out above, you are essentially looking for the connected components of a simple undirected graph where the nodes are your entities (T1, T2 and so), and edges represent the pairwise relations between them. A simple implementation for connected component search is based on the breadth-first search: you start a BFS from the first entity, find all the related entities, then start another BFS from the first yet unfound entity and so on, until you have found them all. A simple implementation of BFS looks like this:

class BreadthFirstSearch(object):
    """Breadth-first search implementation using an adjacency list"""

    def __init__(self, adj_list):
        self.adj_list = adj_list

    def run(self, start_vertex):
        """Runs a breadth-first search from the given start vertex and
        yields the visited vertices one by one."""
        queue = deque([start_vertex])
        visited = set([start_vertex])
        adj_list = self.adj_list

        while queue:
            vertex = queue.popleft()
            yield vertex
            unseen_neis = adj_list[vertex]-visited
            visited.update(unseen_neis)
            queue.extend(unseen_neis)

def connected_components(graph):
    seen_vertices = set()
    bfs = BreadthFirstSearch(graph)
    for start_vertex in graph:
        if start_vertex in seen_vertices:
            continue
        component = list(bfs.run(start_vertex))
        yield component
        seen_vertices.update(component)

Here, adj_list or graph is an adjacency list data structure, basically it gives you the neighbours of a given vertex in the graph. To build it from your file, you can do this:

adj_list = defaultdict(set)
for line in open("your_file.txt"):
    parts = line.strip().split()
    v1 = parts.pop(0)
    adj_list[v1].update(parts)
    for v2 in parts:
        adj_list[v2].add(v1)

Then you can run:

components = list(connected_components(adj_list))

Of course, implementing the whole algorithm in pure Python tends to be slower than an implementation in C with a more efficient graph data structure. You might consider using igraph or some other graph library like NetworkX to do the job instead. Both libraries contain implementations for connected component search; in igraph, it boils down to this (assuming that your file does not contain lines with single entries, only pairwise entries are accepted):

>>> from igraph import load
>>> graph = load("edge_list.txt", format="ncol", directed=False)
>>> components = graph.clusters()
>>> print graph.vs[components[0]]["name"]
['T1', 'T2', 'T6']
>>> print graph.vs[components[1]]["name"]
['T3', 'T4', 'T5']

Disclaimer: I am one of the authors of igraph

0 讨论(0)

予麋鹿

2020-12-03 02:11

Once you have built the data structure, exactly what queries do you want to run against it? Show us your existing code. What is a T(x)? You talk about "groups of numbers" but your sample data shows T1, T2, etc; please explain.

Have you read this: http://en.wikipedia.org/wiki/Disjoint-set_data_structure

Try looking at this Python implementation: http://code.activestate.com/recipes/215912-union-find-data-structure/

OR you can lash up something rather simple and understandable yourself, e.g.

[Update: totally new code]

class DisjointSet(object):

    def __init__(self):
        self.leader = {} # maps a member to the group's leader
        self.group = {} # maps a group leader to the group (which is a set)

    def add(self, a, b):
        leadera = self.leader.get(a)
        leaderb = self.leader.get(b)
        if leadera is not None:
            if leaderb is not None:
                if leadera == leaderb: return # nothing to do
                groupa = self.group[leadera]
                groupb = self.group[leaderb]
                if len(groupa) < len(groupb):
                    a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
                groupa |= groupb
                del self.group[leaderb]
                for k in groupb:
                    self.leader[k] = leadera
            else:
                self.group[leadera].add(b)
                self.leader[b] = leadera
        else:
            if leaderb is not None:
                self.group[leaderb].add(a)
                self.leader[a] = leaderb
            else:
                self.leader[a] = self.leader[b] = a
                self.group[a] = set([a, b])

data = """T1 T2
T3 T4
T5 T1
T3 T6
T7 T8
T3 T7
T9 TA
T1 T9"""
# data is chosen to demonstrate each of 5 paths in the code
from pprint import pprint as pp
ds = DisjointSet()
for line in data.splitlines():
    x, y = line.split()
    ds.add(x, y)
    print
    print x, y
    pp(ds.leader)
    pp(ds.group)

and here is the output from the last step:

T1 T9
{'T1': 'T1',
 'T2': 'T1',
 'T3': 'T3',
 'T4': 'T3',
 'T5': 'T1',
 'T6': 'T3',
 'T7': 'T3',
 'T8': 'T3',
 'T9': 'T1',
 'TA': 'T1'}
{'T1': set(['T1', 'T2', 'T5', 'T9', 'TA']),
 'T3': set(['T3', 'T4', 'T6', 'T7', 'T8'])}

0 讨论(0)

遇见更好的自我

2020-12-03 02:16

You can use a union find data structure to achieve this goal.

The pseudo code for such an algorithm is as follows:

func find( var element )
    while ( element is not the root ) element = element's parent
    return element
end func

func union( var setA, var setB )
    var rootA = find( setA ), rootB = find( setB )
    if ( rootA is equal to rootB ) return
    else
        set rootB as rootA's parent
end func

(Taken from http://www.algorithmist.com/index.php/Union_Find)

0 讨论(0)

闹比i

2020-12-03 02:23

Treat your numbers T1, T2, etc. as graph vertices. Any two numbers appearing together on a line are joined by an edge. Then your problem amounts to finding all the connected components in this graph. You can do this by starting with T1, then doing a breadth-first or depth-first search to find all vertices reachable from that point. Mark all these vertices as belonging to equivalence class T1. Then find the next unmarked vertex Ti, find all the yet-unmarked nodes reachable from there, and label them as belonging to equivalence class Ti. Continue until all the vertices are marked.

For a graph with n vertices and e edges, this algorithm requires O(e) time and space to build the adjacency lists, and O(n) time and space to identify all the connected components once the graph structure is built.

0 讨论(0)
发布评论:

提交评论
- 加载中...

夕颜

2020-12-03 02:28

You can model a group using a set. In the example below, I've put the set into a Group class to make it easier to keep references to them and to track some notional 'head' item.

class Group:
    def __init__(self,head):
        self.members = set()
        self.head = head
        self.add(head)
    def add(self,member):
        self.members.add(member)
    def union(self,other):
        self.members = other.members.union(self.members)

groups = {}

for line in open("sets.dat"):
    line = line.split()
    if len(line) == 0:
        break
    # find the group of the first item on the row
    head = line[0]
    if head not in groups:
        group = Group(head)
        groups[head] = group
    else:
        group = groups[head]
    # for each other item on the row, merge the groups
    for node in line[1:]:
        if node not in groups:
            # its a new node, straight into the group
            group.add(node)
            groups[node] = group
        elif head not in groups[node].members:
            # merge two groups
            new_members = groups[node]
            group.union(new_members)
            for migrate in new_members.members:
                groups[migrate] = group
# list them
for k,v in groups.iteritems():
    if k == v.head:
        print v.members

Output is:

set(['T6', 'T2', 'T1'])
set(['T4', 'T5', 'T3'])

0 讨论(0)