Fill matrix of occurences from column/row arrays of indexes

后端未结

关注

 2  649

后悔当初 2021-01-13 21:10

I\'m searching for an efficient way to create a matrix of occurrences from two arrays that contains indexes, one represents the row indexes in this matrix,

2条回答

囚心锁ツ (楼主)

2021-01-13 21:38

Approach #1

We can convert those pairs to linear indices and then use np.bincount -

def bincount_app(rows, columns, n_rows, n_columns):
    # Get linear index equivalent
    lidx = (columns.max()+1)*rows + columns

    # Use binned count on the linear indices
    return np.bincount(lidx, minlength=n_rows*n_columns).reshape(n_rows,n_columns)

Sample run -

In [242]: n_rows    = 4
     ...: n_columns = 3
     ...: 
     ...: rows    = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
     ...: columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])

In [243]: bincount_app(rows, columns, n_rows, n_columns)
Out[243]: 
array([[1, 0, 0],
       [0, 2, 0],
       [0, 1, 2],
       [2, 1, 5]])

Approach #2

Alternatively, we can sort the linear indices and get the counts using slicing to have our second approach, like so -

def mask_diff_app(rows, columns, n_rows, n_columns):
    lidx = (columns.max()+1)*rows + columns
    lidx.sort()
    mask = np.concatenate(([True],lidx[1:] != lidx[:-1],[True]))
    count = np.diff(np.flatnonzero(mask))
    new_matrix = np.zeros([n_rows, n_columns],dtype=int)
    new_matrix.flat[lidx[mask[:-1]]] = count
    return new_matrix

Approach #3

This seems like a straight-forward one with sparse matrix csr_matrix as well, as it does accumulation on its own for repeated indices. The benefit is the memory efficiency, given that it's a sparse matrix, which would be noticeable if you are filling a small number of places in the output and a sparse matrix output is okay.

The implementation would look something like this -

from scipy.sparse import csr_matrix

def sparse_matrix_app(rows, columns, n_rows, n_columns):
    out_shp = (n_rows, n_columns)
    data = np.ones(len(rows),dtype=int)
    return csr_matrix((data, (rows, columns)), shape=out_shp)

If you need a regular/dense array, simply do -

sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()

Sample output -

In [319]: sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
Out[319]: 
array([[1, 0, 0],
       [0, 2, 0],
       [0, 1, 2],
       [2, 1, 5]])

Benchmarking

Other approach(es) -

# @cᴏʟᴅsᴘᴇᴇᴅ's soln
def add_at_app(rows, columns, n_rows, n_columns):
    new_matrix = np.zeros([n_rows, n_columns],dtype=int)
    np.add.at(new_matrix, (rows, columns), 1)

Timings

Case #1 : Output array of shape (1000, 1000) and no. of indices = 10k

In [307]: # Setup
     ...: n_rows = 1000
     ...: n_columns = 1000
     ...: rows = np.random.randint(0,1000,(10000))
     ...: columns = np.random.randint(0,1000,(10000))

In [308]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 424 µs per loop
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 1.41 ms per loop

Case #2 : Output array of shape (1000, 1000) and no. of indices = 100k

In [309]: # Setup
     ...: n_rows = 1000
     ...: n_columns = 1000
     ...: rows = np.random.randint(0,1000,(100000))
     ...: columns = np.random.randint(0,1000,(100000))

In [310]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.4 ms per loop
1000 loops, best of 3: 1.27 ms per loop
100 loops, best of 3: 7.44 ms per loop
10 loops, best of 3: 20.4 ms per loop

Case #3 : Sparse-ness in output

As stated earlier, for the sparse method to work better, we would need sparse-ness. Such a case would be like this -

In [314]: # Setup
     ...: n_rows = 5000
     ...: n_columns = 5000
     ...: rows = np.random.randint(0,5000,(1000))
     ...: columns = np.random.randint(0,5000,(1000))

In [315]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.7 ms per loop
100 loops, best of 3: 11.1 ms per loop
100 loops, best of 3: 11.1 ms per loop
1000 loops, best of 3: 269 µs per loop

If you need a dense array, we lose the memory efficiency and hence performance one as well -

In [317]: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
100 loops, best of 3: 11.7 ms per loop

0 讨论(0)

查看其它2个回答