How to create 2d array with numpy random.choice for every rows?

后端 未结 4 1831
我寻月下人不归
我寻月下人不归 2020-11-30 15:30

I\'m trying to create a 2d array (which is a six column and lots of rows) with numpy random choice with unique values between 1 and 50 for every row not all of the array

相关标签:
4条回答
  • 2020-11-30 16:01

    This isn't pure numpy but you could wrap your solution within a list comprehension

    >>> rows = 10
    >>> cols = 6
    >>> np.array([np.random.choice(np.arange(1, 50), size=cols, replace=False) for _ in range(rows)])
    array([[ 9, 10, 21, 33, 34, 15],
           [48, 46, 36,  7, 37, 45],
           [21, 15,  5,  9, 31, 26],
           [48, 24, 30, 18, 47, 23],
           [22, 31, 19, 32,  3, 33],
           [35, 44, 15, 46, 20, 43],
           [11, 37, 44,  6, 16, 35],
           [42, 49, 41, 28, 12, 19],
           [19,  6, 32,  3,  1, 22],
           [29, 33, 42,  5, 30, 43]])
    
    0 讨论(0)
  • 2020-11-30 16:10

    Here is a constructive approach, draw first (50 choices), second (49 choices) etc. For large sets it's quite competitive (pp in table):

    # n = 10
    # pp                    0.18564210 ms
    # Divakar               0.01960790 ms
    # James                 0.20074140 ms
    # CK                    0.17823420 ms
    # n = 1000
    # pp                    0.80046050 ms
    # Divakar               1.31817130 ms
    # James                18.93511460 ms
    # CK                   20.83670820 ms
    # n = 1000000
    # pp                  655.32905590 ms
    # Divakar            1352.44713990 ms
    # James             18471.08987370 ms
    # CK                18369.79808050 ms
    # pp     checking plausibility...
    #     var (exp obs) 208.333333333 208.363840259
    #     mean (exp obs) 25.5 25.5064865
    # Divakar     checking plausibility...
    #     var (exp obs) 208.333333333 208.21113972
    #     mean (exp obs) 25.5 25.499471
    # James     checking plausibility...
    #     var (exp obs) 208.333333333 208.313436938
    #     mean (exp obs) 25.5 25.4979035
    # CK     checking plausibility...
    #     var (exp obs) 208.333333333 208.169585249
    #     mean (exp obs) 25.5 25.49
    

    Code including benchmarking. Algo is a bit complicated because mapping to free spots is hairy:

    import numpy as np
    import types
    from timeit import timeit
    
    def f_pp(n):
        draw = np.empty((n, 6), dtype=int)
        # generating random numbers is expensive, so draw a large one and
        # make six out of one
        draw[:, 0] = np.random.randint(0, 50*49*48*47*46*45, (n,))
        draw[:, 1:] = np.arange(50, 45, -1)
        draw = np.floor_divide.accumulate(draw, axis=-1)
        draw[:, :-1] -= draw[:, 1:] * np.arange(50, 45, -1)
        # map the shorter ranges (:49, :48, :47) to the non-occupied
        # positions; this amounts to incrementing for each number on the
        # left that is not larger. the nasty bit: if due to incrementing
        # new numbers on the left are "overtaken" then for them we also
        # need to increment.
        for i in range(1, 6):
            coll = np.sum(draw[:, :i] <= draw[:, i, None], axis=-1)
            collidx = np.flatnonzero(coll)
            if collidx.size == 0:
                continue
            coll = coll[collidx]
            tot = coll
            while True:
                draw[collidx, i] += coll
                coll = np.sum(draw[collidx, :i] <= draw[collidx, i, None], axis=-1)
                relidx = np.flatnonzero(coll > tot)
                if relidx.size == 0:
                    break
                coll, tot = coll[relidx]-tot[relidx], coll[relidx]
                collidx = collidx[relidx]
    
        return draw + 1
    
    def check_result(draw, name):
        print(name[2:], '    checking plausibility...')
        import scipy.stats
        assert all(len(set(row)) == 6 for row in draw)
        assert len(set(draw.ravel())) == 50
        print('    var (exp obs)', scipy.stats.uniform(0.5, 50).var(), draw.var())
        print('    mean (exp obs)', scipy.stats.uniform(0.5, 50).mean(), draw.mean())
    
    def f_Divakar(n):
        return np.random.rand(n, 50).argpartition(6,axis=1)[:,:6]+1
    
    def f_James(n):
        return np.stack([np.random.choice(np.arange(1,51),size=6,replace=False) for i in range(n)])
    
    def f_CK(n):
        return np.array([np.random.choice(np.arange(1, 51), size=6, replace=False) for _ in range(n)])
    
    for n in (10, 1_000, 1_000_000):
        print(f'n = {n}')
        for name, func in list(globals().items()):
            if not name.startswith('f_') or not isinstance(func, types.FunctionType):
                continue
            try:
                print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                    'f(n)', globals={'f':func, 'n':n}, number=10)*100))
            except:
                print("{:16s} apparently failed".format(name[2:]))
        if(n >= 10000):
            for name, func in list(globals().items()):
                if name.startswith('f_') and isinstance(func, types.FunctionType):
    
                    check_result(func(n), name)
    
    0 讨论(0)
  • 2020-11-30 16:12

    You can create each row by itself and then stack them.

    np.stack([np.random.choice(np.arange(1,50),size=6,replace=False) for i in range(100)])
    
    0 讨论(0)
  • 2020-11-30 16:17

    Here's a vectorized approach with rand+argsort/argpartition trick from here -

    np.random.rand(rows, 50).argpartition(6,axis=1)[:,:6]+1
    

    Sample run -

    In [41]: rows = 10
    
    In [42]: np.random.rand(rows, 50).argpartition(6,axis=1)[:,:6]+1
    Out[42]: 
    array([[ 1,  9,  3, 26, 14, 44],
           [32, 20, 27, 13, 25, 45],
           [40, 12, 47, 16, 10, 29],
           [ 6, 36, 32, 16, 18,  4],
           [42, 46, 24,  9,  1, 31],
           [15, 25, 47, 42, 34, 24],
           [ 7, 16, 49, 31, 40, 20],
           [28, 17, 47, 36,  8, 44],
           [ 7, 42, 14,  4, 17, 35],
           [39, 19, 37,  7,  8, 36]])
    

    Just to prove the random-ness -

    In [56]: rows = 1000000
    
    In [57]: out = np.random.rand(rows, 50).argpartition(6,axis=1)[:,:6]+1
    
    In [58]: np.bincount(out.ravel())[1:]
    Out[58]: 
    array([120048, 120026, 119942, 119838, 119885, 119669, 119965, 119491,
           120280, 120108, 120293, 119399, 119917, 119974, 120195, 119796,
           119887, 119505, 120235, 119857, 119499, 120560, 119891, 119693,
           120081, 120369, 120011, 119714, 120218, 120581, 120111, 119867,
           119791, 120265, 120457, 120048, 119813, 119702, 120266, 120445,
           120016, 120190, 119576, 119737, 120153, 120215, 120144, 120196,
           120218, 119863])
    

    Timings on one million rows of data -

    In [43]: rows = 1000000
    
    In [44]: %timeit np.random.rand(rows, 50).argpartition(6,axis=1)[:,:6]+1
    1 loop, best of 3: 1.07 s per loop
    
    0 讨论(0)
提交回复
热议问题