How to sample from Cartesian product without repetition

后端 未结 5 1400
心在旅途
心在旅途 2020-12-18 13:41

I have a list of sets, and I wish to sample n different samples each containing an item from each set. What I do not want is to have it in order, so, for example, I will ge

相关标签:
5条回答
  • 2020-12-18 13:54

    You can use sample from the random lib:

    import random
    [[random.sample(x,1)[0] for x in list_of_sets] for _ in range(n)]
    

    for example:

    list_of_sets = [{1,2,3}, {4,5,6}, {1,4,7}]
    n = 3
    

    A possible output will be:

    [[2, 4, 7], [1, 4, 7], [1, 6, 1]]
    

    EDIT:

    If we want to avoid repetitions we can use a while loop and collect the results to a set. In addition you can check that n is valid and return the Cartesian product for invalid n values:

    chosen = set()
    if 0 < n < reduce(lambda a,b: a*b,[len(x) for x in list_of_sets]):
        while len(chosen) < n:
            chosen.add(tuple([random.sample(x,1)[0] for x in list_of_sets]))
    else:
        chosen = itertools.product(*list_of_sets)
    
    0 讨论(0)
  • 2020-12-18 13:59

    As I want no repetition, and sometimes it is not possible the code is not that short. But as @andreyF said, random.sample does the work. Perhaps there is also a better way that avoids resampling with repetition until enough non repetitive ones exist, this is the best I have so far.

    import operator
    import random
    def get_cart_product(list_of_sets, n=None):
        max_products_num = reduce(operator.mul, [len(cluster) for cluster in list_of_sets], 1)
        if n is not None and n < max_products_num:
            refs = set()
            while len(refs) < n:
                refs.add(tuple(random.sample(cluster, 1)[0] for cluster in list_of_sets))
            return refs
            return (prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))
        return itertools.product(*list_of_sets)
    

    Note that the code assumes a list of frozen sets, a conversion of random.sample(cluster, 1)[0] should be done otherwise.

    0 讨论(0)
  • 2020-12-18 14:03

    All the above solutions waste a lot of resources for filtering repeated results when it comes to the end of the iteration. That's why I have thought of a method that has (almost) linear speed from start until the very end.

    The idea is: Give (only in your head) each result of the standard order cartesian product an index. That would be for example for AxBxC with 2000x1x2 = 4000 elements:

    0: (A[0], B[0], C[0])
    1: (A[1], B[0], C[0])
    ...
    1999: (A[1999], B[0], C[0])
    2000: (A[0], B[0], C[1])
    ...
    3999: (A[1999], B[0], C[1])
    done.
    

    So there are still some questions open:

    • How do I get a list of possible indices? Answer: Just multiply 2000*1*2=4000 and every number below that will be a valid index.
    • How do I generate random indices sequentially without repetition? There are two answers: If you want samples with a known sample size n, just use random.sample(xrange(numer_of_indices), n). But if you don't know the sample size yet (more general case), you have to generate indices on the fly to not waste memory. In that case, you can just generate index = random.randint(0, k - 1) with k = numer_of_indices to get the first index and k = number_of_indices - n for the nth result. Just check my code below (be aware, that I use a one sided linked list there to store the done indices. It makes insert operations O(1) operations and we need a lot of insertions here).
    • How do I generate the output from the index? Answer: Well, say our index is i. Then i % 2000 will be the index of A for the result. Now i // 2000 can be treated recursively as the index for the cartesian product of the remaining factors.

    So this is the code I came up with:

    def random_order_cartesian_product(*factors):
        amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
        index_linked_list = [None, None]
        for max_index in reversed(range(amount)):
            index = random.randint(0, max_index)
            index_link = index_linked_list
            while index_link[1] is not None and index_link[1][0] <= index:
                index += 1
                index_link = index_link[1]
            index_link[1] = [index, index_link[1]]
            items = []
            for factor in factors:
                items.append(factor[index % len(factor)])
                index //= len(factor)
            yield items
    
    0 讨论(0)
  • 2020-12-18 14:03

    Matmarbon's answer is valid, this is a complete version with an example and some modifies for easy understanding and easy use:

    import functools
    import random
    
    def random_order_cartesian_product(factors):
        amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
        print(amount)
        print(len(factors[0]))
        index_linked_list = [None, None]
        for max_index in reversed(range(amount)):
            index = random.randint(0, max_index)
            index_link = index_linked_list
            while index_link[1] is not None and index_link[1][0] <= index:
                index += 1
                index_link = index_link[1]
            index_link[1] = [index, index_link[1]]
            items = []
            for factor in factors:
                items.append(factor[index % len(factor)])
                index //= len(factor)
            yield items
    
    
    factors=[
        [1,2,3],
        [4,5,6],
        [7,8,9]
    ]
    
    n = 5
    
    all = random_order_cartesian_product(factors)
    
    count = 0
    
    for comb in all:
      print(comb)
      count += 1
      if count == n:
        break
    
    0 讨论(0)
  • 2020-12-18 14:04

    The following generator function generates non-repetitive samples. It will only work performantly if the number of samples generated is much smaller than the number of possible samples. It also requires the elements of the sets to be hashable:

    def samples(list_of_sets):
        list_of_lists = list(map(list, list_of_sets))  # choice only works on sequences
        seen = set()  # keep track of seen samples
        while True:
            x = tuple(map(random.choice, list_of_lists))  # tuple is hashable
            if x not in seen:
                seen.add(x)
                yield x
    
    >>> lst = [{'b', 'a'}, {'c', 'd'}, {'f', 'e'}, {'g', 'h'}]
    >>> gen = samples(lst)
    >>> next(gen)
    ('b', 'c', 'f', 'g')
    >>> next(gen)
    ('a', 'c', 'e', 'g')
    >>> next(gen)
    ('b', 'd', 'f', 'h')
    >>> next(gen)
    ('a', 'c', 'f', 'g')
    
    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题