Quickly find subset of list of lists with greatest total distinct elements

问题

Given a list of lists of tuples, I would like to find the subset of lists which maximize the number of distinct integer values without any integer being repeated.

The list looks something like this:

x = [
         [(1,2,3), (8,9,10), (15,16)],
         [(2,3), (10,11)],
         [(9,10,11), (17,18,19), (20,21,22)],
         [(4,5), (11,12,13), (18,19,20)]
    ]

The internal tuples are always sequential --> (1,2,3) or (15,16), but they may be of any length.

In this case, the expected return would be:

maximized_list = [
                  [(1, 2, 3), (8, 9, 10), (15, 16)], 
                  [(4, 5), (11, 12, 13), (18, 19, 20)]
                 ]

This is valid because in each case:

Each internal list of x remains intact
There is a maximum number of distinct integers (16 in this case)
No integer is repeated.

If there are multiple valid solutions, all should be return in a list.

I have a naive implementation of this, heavily based on a previous stackoverflow question I had asked, which was not as well formed as it could have been (Python: Find tuples with greatest total distinct values):

import itertools

def maximize(self, x):
    max_ = 0
    possible_patterns = []

    for i in range(1, len(x)+1):
        b = itertools.combinations(x, i)

        for combo in b:
            all_ints = tuple(itertools.chain(*itertools.chain(*combo)))
            distinct_ints = tuple(set(all_ints))

            if sorted(all_ints) != sorted(distinct_ints):
                continue
            else:
                if len(all_ints) >= max_:
                    if len(all_ints) == max_:
                        possible_patterns.append(combo)
                        new_max = len(all_ints)
                    elif len(all_ints) > max_:
                        possible_patterns = [combo]
                        new_max = len(all_ints)
                    max_ = new_max

    return possible_patterns

The above-mentioned function appears to give me the correct result, but does not scale. I will need to accept x values with a few thousand lists (possibly as many as tens of thousands), so an optimized algorithm is required.

回答1:

The following solves for the maximal subset of sublists, with respect to cardinality. It works by flattening each sublist, constructing a list of sets of intersections between the sublists, and then searches the solution space in a depth-first-search for the solution with the most elements (i.e. largest "weight").

def maximize_distinct(sublists):
    subsets = [{x for tup in sublist for x in tup} for sublist in sublists]

    def intersect(subset):
        return {i for i, sset in enumerate(subsets) if subset & sset}

    intersections = [intersect(subset) for subset in subsets]
    weights = [len(subset) for subset in subsets]

    pool = set(range(len(subsets)))
    max_set, _ = search_max(pool, intersections, weights)
    return [sublists[i] for i in max_set]

def search_max(pool, intersections, weights):
    if not pool: return [], 0

    max_set = max_weight = None
    for num in pool:
        next_pool = {x for x in pool - intersections[num] if x > num}
        set_ids, weight = search_max(next_pool, intersections, weights)

        if not max_set or max_weight < weight + weights[num]:
            max_set, max_weight = [num] + set_ids, weight + weights[num]
    return max_set, max_weight

This code can be optimized further by keeping a running total of the "weights" (sum of cardinalities of sublists) discarded, and pruning that branch of the search space when it exceeds that of the maximal solution so far (which will be the minimal discard weight). Unless you run into performance problems however, this will likely be more work than its worth, and for a small list of lists the overhead of the computation will exceed the speedup of pruning.

来源：https://stackoverflow.com/questions/54725354/quickly-find-subset-of-list-of-lists-with-greatest-total-distinct-elements

标签

python

algorithm

optimization