Find sets that contain at least one element from other sets

可紊 提交于 2020-06-29 05:07:45

问题


Suppose we are given n sets and want to construct all minimal sets that have at least one element in common with each of the input sets. A set S is called minimal, if there is no admissible set S' that is a subset of S.

An example:

In: s1 = {1, 2, 3}; s2 = {3, 4, 5}; s3 = {5, 6}

Out: [{1, 4, 6}, {1, 5}, {2, 4, 6}, {2, 5}, {3, 5}, {3, 6}]

My idea was to iteratively add one set after the other to the solution:

result = f(s1, f(s2, f(s3, ...)))

whereby f is a merge function that could look as follows:

function f(newSet, setOfSets):
   Step 1: 
      return all elements of setOfSets that share an element with newSet

   Step 2: 
      for each remaining element setE of setOfSets:
         for each element e of newSet:
            return union(setE, {e})

The issue with the above appraoch is that the cartesian product computed in step 2 may contain supersets of sets returned in step 1. I was thinking of going through all already returned sets (see Find minimal set of subsets that covers a given set), but this seems to be too complicated and inefficient, and I hope that there is a better solution in my special case.

How could I achieve the goal without determining the full cartesian product in step 2?

Note that this question is related to the question of finding the smallest set only, but I need to find all sets that are minimal in the way specified above. I am aware that the number of solutions will not be polynomial.

The number n of input sets will be several hundret, but the sets contain only elements from a limited range (e.g. about 20 different values), which also limits the sets' sizes. It would be acceptible if the algorithm runs in O(n^2), but it should be basically linear (maybe with a log multiplier) of the output sets.


回答1:


Since your space is so constrained -- only 20 values from which to choose -- beat this thing to death with a blunt instrument:

  1. Convert each of your target sets (the ones to be covered) to a bit-map. In your given case, this will correspond to an integer of 20 bits, one bit position for each of the 20 values.
  2. Create a list of candidate covering bitmaps, the integers 0 through (2^20-1)
  3. Take the integers in order. Use bit operations to determine whether each target set has a 1 bit in common with the candidate. If all satisfy the basic condition, the candidate is validated.
  4. When you validate a candidate, remove all super-set integers from the list of candidates.
  5. When you run out of candidates, your validates candidates are the desired collection. In the code below, I simply print each as it is identified.

Code:

from time import time
start = time()

s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}

# Convert each set to its bit-map
point_set = [7, 28, 48]

# make list of all possible covering bitmaps
cover = list(range(2**20))

while cover:
    # Pop any item from remaining covering sets
    candidate = cover.pop(0)
    # Does this bitmap have a bit in common with each target set?
    if all((candidate & point) for point in point_set):
        print(candidate)

        # Remove all candidates that are supersets of the successful covering one.
        superset = set([other for other in cover if (candidate & ~other) == 0])
        cover = [item for item in cover if item not in superset]
        print(time() - start, "lag time")

print(time() - start, "seconds")

Output -- I have not converted the candidate integers back to their constituent elements. This is a straightforward task.

Note that most of the time in this example is spent in exhausting the list of integers that were not supersets of a validated cover set, such as all multiples of 32 (the lower 6 bits are all zero, and thus are disjoint from any cover set).

This 33 seconds is on my aging desktop computer; your laptop or other platform is almost certainly faster. I trust that any improvement from a more efficient algorithm is easily offset in that this algorithm is quick to implement and easier to understand.

17
0.4029195308685303 lag time
18
0.6517734527587891 lag time
20
0.8456630706787109 lag time
36
1.0555419921875 lag time
41
1.2604553699493408 lag time
42
1.381387710571289 lag time
33.005757570266724 seconds



回答2:


I have come up with a solution based on the trie data structure as described here. Tries make it relatively fast to determine whether one of the stored sets is a subset of another given set (Savnik, 2013).

The solution then looks as follows:

  • Create a trie
  • Iterate through the given sets
    • In each iteration, go through the sets in the trie and check if they are disjoint with the new set.
    • If they are, continue; if not, add corresponding new sets to the trie unless they are supersets of sets in the trie.

The worst-case runtime is O(n m c), whereby m is the maximal number of solutions if we consider only n' <= n of the input sets, and c is the time factor from the subset lookups.

The code is below. I have implemented the algorithm based on the python package datrie, which is a wrapper around an efficent C implementation of a trie. The code below is in cython but can be converted to pure python easily by removing/exchangin cython specific commands.

The extended trie implementation:

from datrie cimport BaseTrie, BaseState, BaseIterator

cdef bint has_subset_c(BaseTrie trie, BaseState trieState, str setarr, 
                        int index, int size):
    cdef BaseState trieState2 = BaseState(trie)
    cdef int i
    trieState.copy_to(trieState2)
    for i in range(index, size):
        if trieState2.walk(setarr[i]):
            if trieState2.is_terminal() or has_subset_c(trie, trieState2, setarr, 
                                                        i, size): 
                return True
            trieState.copy_to(trieState2)
    return False


cdef class SetTrie():
    def __init__(self, alphabet, initSet=[]):
        if not hasattr(alphabet, "__iter__"):
            alphabet = range(alphabet)
        self.trie = BaseTrie("".join(chr(i) for i in alphabet))
        self.touched = False
        for i in initSet:
            self.trie[chr(i)] = 0
            if not self.touched:
                self.touched = True

    def has_subset(self, superset):
        cdef BaseState trieState = BaseState(self.trie)
        setarr = "".join(chr(i) for i in superset)
        return bool(has_subset_c(self.trie, trieState, setarr, 0, len(setarr)))

    def extend(self, sets):
        for s in sets:
            self.trie["".join(chr(i) for i in s)] = 0
            if not self.touched:
                self.touched = True

    def delete_supersets(self):
        cdef str elem 
        cdef BaseState trieState = BaseState(self.trie)
        cdef BaseIterator trieIter = BaseIterator(BaseState(self.trie))
        if trieIter.next():
            elem = trieIter.key()
            while trieIter.next():
                self.trie._delitem(elem)
                if not has_subset_c(self.trie, trieState, elem, 0, len(elem)):
                    self.trie._setitem(elem, 0)
                elem = trieIter.key()
            if has_subset_c(self.trie, trieState, elem, 0, len(elem)):
                val = self.trie.pop(elem)
                if not has_subset_c(self.trie, trieState, elem, 0, len(elem)):
                    self.trie._setitem(elem, val)


    def update_by_settrie(self, SetTrie setTrie, maxSize=inf, initialize=True):
        cdef BaseIterator trieIter = BaseIterator(BaseState(setTrie.trie))
        cdef str s
        if initialize and not self.touched and trieIter.next():
            for s in trieIter.key():
                self.trie._setitem(s, 0)
            self.touched = True

        while trieIter.next():
            self.update(set(trieIter.key()), maxSize, True)

    def update(self, otherSet, maxSize=inf, isStrSet=False):
        if not isStrSet:
            otherSet = set(chr(i) for i in otherSet)
        cdef str subset, newSubset, elem
        cdef list disjointList = []
        cdef BaseTrie trie = self.trie
        cdef int l
        cdef BaseIterator trieIter = BaseIterator(BaseState(self.trie))
        if trieIter.next():
            subset = trieIter.key()
            while trieIter.next():
                if otherSet.isdisjoint(subset):
                    disjointList.append(subset)
                    trie._delitem(subset)
                subset = trieIter.key()
            if otherSet.isdisjoint(subset):
                disjointList.append(subset)
                trie._delitem(subset)

        cdef BaseState trieState = BaseState(self.trie)
        for subset in disjointList:
            l = len(subset)
            if l < maxSize:
                if l+1 > self.maxSizeBound:
                    self.maxSizeBound = l+1
                for elem in otherSet:
                    newSubset = subset + elem
                    trieState.rewind()
                    if not has_subset_c(self.trie, trieState, newSubset, 0, 
                                        len(newSubset)):
                        trie[newSubset] = 0

    def get_frozensets(self):
        return (frozenset(ord(t) for t in subset) for subset in self.trie)

    def clear(self):
        self.touched = False
        self.trie.clear()

    def prune(self, maxSize):
        cdef bint changed = False
        cdef BaseIterator trieIter 
        cdef str k
        if self.maxSizeBound > maxSize:
            self.maxSizeBound = maxSize
            trieIter = BaseIterator(BaseState(self.trie))
            k = ''
            while trieIter.next():
                if len(k) > maxSize:
                    self.trie._delitem(k)
                    changed = True
                k = trieIter.key()
            if len(k) > maxSize:
                self.trie._delitem(k)
                changed = True
        return changed

    def __nonzero__(self):
        return self.touched

    def __repr__(self):
        return str([set(ord(t) for t in subset) for subset in self.trie])

This can be used as follows:

def cover_sets(sets):
    strie = SetTrie(range(10), *([i] for i in sets[0]))
    for s in sets[1:]:
        strie.update(s)
    return strie.get_frozensets()

Timing:

from timeit import timeit
s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}
%timeit cover_sets([s1, s2, s3])

Result:

37.8 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.



来源:https://stackoverflow.com/questions/62058214/find-sets-that-contain-at-least-one-element-from-other-sets

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!