Efficiently counting number of unique elements - NumPy / Python

后端 未结 3 2108
遥遥无期
遥遥无期 2021-01-03 02:07

When running np.unique(), it first flattens the array, sorts the array, then finds the unique values. When I have arrays with shape (10, 3000, 3000), it takes about a second

相关标签:
3条回答
  • 2021-01-03 02:25

    We could leverage the fact that the elements are restricted to uint8 range by binned-counting with np.bincount and then simply count the number of non-zeros in it. Since, np.bincount expects a 1D array, we would flatten the input with np.ravel() and then feed it to bincount.

    Hence, the implementation would be -

    (np.bincount(a.ravel())!=0).sum()
    

    Runtime test

    Helper function to create input array with various number of unique numbers -

    def create_input(n_unique):
        unq_nums = np.random.choice(np.arange(256), n_unique,replace=0)
        return np.random.choice(unq_nums, (10,3000,3000)).astype(np.uint8)
    

    Other approach(es) :

    # @Warren Weckesser's soln
    def assign_method(a):
        q = np.zeros(256, dtype=int)
        q[a.ravel()] = 1
        return len(np.nonzero(q)[0])
    

    Verification of proposed method -

    In [141]: a = create_input(n_unique=120)
    
    In [142]: len(np.unique(a))
    Out[142]: 120
    
    In [143]: (np.bincount(a.ravel())!=0).sum()
    Out[143]: 120
    

    Timings -

    In [124]: a = create_input(n_unique=128)
    
    In [125]: %timeit len(np.unique(a)) # Original soln
         ...: %timeit assign_method(a)  # @Warren Weckesser's soln
         ...: %timeit (np.bincount(a.ravel())!=0).sum()
         ...: 
    1 loop, best of 3: 3.09 s per loop
    1 loop, best of 3: 394 ms per loop
    1 loop, best of 3: 209 ms per loop
    
    In [126]: a = create_input(n_unique=256)
    
    In [127]: %timeit len(np.unique(a)) # Original soln
         ...: %timeit assign_method(a)  # @Warren Weckesser's soln
         ...: %timeit (np.bincount(a.ravel())!=0).sum()
         ...: 
    1 loop, best of 3: 3.46 s per loop
    1 loop, best of 3: 378 ms per loop
    1 loop, best of 3: 212 ms per loop
    
    0 讨论(0)
  • 2021-01-03 02:32

    Here's a method that works for an array with dtype np.uint8 that is faster than np.unique.

    First, create an array to work with:

    In [128]: a = np.random.randint(1, 128, size=(10, 3000, 3000)).astype(np.uint8)
    

    For later comparison, find the unique values using np.unique:

    In [129]: u = np.unique(a)
    

    Here's the faster method; v will contain the result:

    In [130]: q = np.zeros(256, dtype=int)
    
    In [131]: q[a.ravel()] = 1
    
    In [132]: v = np.nonzero(q)[0]
    

    Verify that we got the same result:

    In [133]: np.array_equal(u, v)
    Out[133]: True
    

    Timing:

    In [134]: %timeit u = np.unique(a)
    2.86 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [135]: %timeit q = np.zeros(256, dtype=int); q[a.ravel()] = 1; v = np.nonzero(q)
    300 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    So 2.86 seconds for np.unique(), and 0.3 seconds for the alternative method.

    0 讨论(0)
  • 2021-01-03 02:33

    If you don't mind using Numba to JIT compile your code, and modifying your code to make it easy for Numba to do its magic, then it's possible to get some good improvements over the suggestions already listed in the other answers.

    Using the naming from @Divakar's post:

    from numba import jit
    import numpy as np
    
    def create_input(n_unique):
        unq_nums = np.random.choice(np.arange(256), n_unique, replace=0)
        return np.random.choice(unq_nums, (10, 3000, 3000)).astype(np.uint8)
    
    def unique(a):
        return len(np.unique(a))
    
    def assign_method(a):
        q = np.zeros(256, dtype=int)
        q[a.ravel()] = 1
        return len(np.nonzero(q)[0])
    
    def bincount(a):
        return (np.bincount(a.ravel())!=0).sum()
    
    def numba_friendly(a):
        q = np.zeros(256, dtype=int)
        count = 0
        for x in a.ravel():
            if q[x] == 0:
                q[x] = 1
                count += 1
        return count
    
    unique_jit = jit(unique)
    assign_method_jit = jit(assign_method)
    bincount_jit = jit(bincount)
    numba_friendly_jit = jit(numba_friendly)
    

    Benchmark:

    a = create_input(n_unique=128)
    
    %timeit unique(a)
    %timeit unique_jit(a)
    %timeit assign_method(a)
    %timeit assign_method_jit(a)
    %timeit bincount(a)
    %timeit bincount_jit(a)
    %timeit numba_friendly_jit(a)
    

    Results:

    unique:               7.5 s ± 1.14 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    unique_jit:          13.4 s ± 2.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    assign_method:       388 ms ± 84.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    assign_method_jit:   341 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    bincount:            2.71 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    bincount_jit:        138 ms ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    numba_friendly_jit: 56.4 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    0 讨论(0)
提交回复
热议问题