Fill matrix of occurences from column/row arrays of indexes

后端 未结 2 636
后悔当初
后悔当初 2021-01-13 21:10

I\'m searching for an efficient way to create a matrix of occurrences from two arrays that contains indexes, one represents the row indexes in this matrix,

2条回答
  •  囚心锁ツ
    2021-01-13 21:38

    Approach #1

    We can convert those pairs to linear indices and then use np.bincount -

    def bincount_app(rows, columns, n_rows, n_columns):
        # Get linear index equivalent
        lidx = (columns.max()+1)*rows + columns
    
        # Use binned count on the linear indices
        return np.bincount(lidx, minlength=n_rows*n_columns).reshape(n_rows,n_columns)
    

    Sample run -

    In [242]: n_rows    = 4
         ...: n_columns = 3
         ...: 
         ...: rows    = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
         ...: columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])
    
    In [243]: bincount_app(rows, columns, n_rows, n_columns)
    Out[243]: 
    array([[1, 0, 0],
           [0, 2, 0],
           [0, 1, 2],
           [2, 1, 5]])
    

    Approach #2

    Alternatively, we can sort the linear indices and get the counts using slicing to have our second approach, like so -

    def mask_diff_app(rows, columns, n_rows, n_columns):
        lidx = (columns.max()+1)*rows + columns
        lidx.sort()
        mask = np.concatenate(([True],lidx[1:] != lidx[:-1],[True]))
        count = np.diff(np.flatnonzero(mask))
        new_matrix = np.zeros([n_rows, n_columns],dtype=int)
        new_matrix.flat[lidx[mask[:-1]]] = count
        return new_matrix
    

    Approach #3

    This seems like a straight-forward one with sparse matrix csr_matrix as well, as it does accumulation on its own for repeated indices. The benefit is the memory efficiency, given that it's a sparse matrix, which would be noticeable if you are filling a small number of places in the output and a sparse matrix output is okay.

    The implementation would look something like this -

    from scipy.sparse import csr_matrix
    
    def sparse_matrix_app(rows, columns, n_rows, n_columns):
        out_shp = (n_rows, n_columns)
        data = np.ones(len(rows),dtype=int)
        return csr_matrix((data, (rows, columns)), shape=out_shp)
    

    If you need a regular/dense array, simply do -

    sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
    

    Sample output -

    In [319]: sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
    Out[319]: 
    array([[1, 0, 0],
           [0, 2, 0],
           [0, 1, 2],
           [2, 1, 5]])
    

    Benchmarking

    Other approach(es) -

    # @cᴏʟᴅsᴘᴇᴇᴅ's soln
    def add_at_app(rows, columns, n_rows, n_columns):
        new_matrix = np.zeros([n_rows, n_columns],dtype=int)
        np.add.at(new_matrix, (rows, columns), 1)
    

    Timings

    Case #1 : Output array of shape (1000, 1000) and no. of indices = 10k

    In [307]: # Setup
         ...: n_rows = 1000
         ...: n_columns = 1000
         ...: rows = np.random.randint(0,1000,(10000))
         ...: columns = np.random.randint(0,1000,(10000))
    
    In [308]: %timeit add_at_app(rows, columns, n_rows, n_columns)
         ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
         ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
         ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
    1000 loops, best of 3: 1.05 ms per loop
    1000 loops, best of 3: 424 µs per loop
    1000 loops, best of 3: 1.05 ms per loop
    1000 loops, best of 3: 1.41 ms per loop
    

    Case #2 : Output array of shape (1000, 1000) and no. of indices = 100k

    In [309]: # Setup
         ...: n_rows = 1000
         ...: n_columns = 1000
         ...: rows = np.random.randint(0,1000,(100000))
         ...: columns = np.random.randint(0,1000,(100000))
    
    In [310]: %timeit add_at_app(rows, columns, n_rows, n_columns)
         ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
         ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
         ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
    100 loops, best of 3: 11.4 ms per loop
    1000 loops, best of 3: 1.27 ms per loop
    100 loops, best of 3: 7.44 ms per loop
    10 loops, best of 3: 20.4 ms per loop
    

    Case #3 : Sparse-ness in output

    As stated earlier, for the sparse method to work better, we would need sparse-ness. Such a case would be like this -

    In [314]: # Setup
         ...: n_rows = 5000
         ...: n_columns = 5000
         ...: rows = np.random.randint(0,5000,(1000))
         ...: columns = np.random.randint(0,5000,(1000))
    
    In [315]: %timeit add_at_app(rows, columns, n_rows, n_columns)
         ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
         ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
         ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
    100 loops, best of 3: 11.7 ms per loop
    100 loops, best of 3: 11.1 ms per loop
    100 loops, best of 3: 11.1 ms per loop
    1000 loops, best of 3: 269 µs per loop
    

    If you need a dense array, we lose the memory efficiency and hence performance one as well -

    In [317]: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
    100 loops, best of 3: 11.7 ms per loop
    

提交回复
热议问题