Repeat a scipy csr sparse matrix along axis 0

℡╲_俬逩灬. 提交于 2019-12-23 10:26:34

问题


I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an ndarray. I looked through the documentation, but I couldn't find any utility to repeats the rows of a scipy csr sparse matrix.

I wrote the following code that operates on the internal data, which seems to work

def csr_repeat(csr, repeats):
    if isinstance(repeats, int):
        repeats = np.repeat(repeats, csr.shape[0])
    repeats = np.asarray(repeats)
    rnnz = np.diff(csr.indptr)
    ndata = rnnz.dot(repeats)
    if ndata == 0:
        return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
                                 dtype=csr.dtype)
    indmap = np.ones(ndata, dtype=np.int)
    indmap[0] = 0
    rnnz_ = np.repeat(rnnz, repeats)
    indptr_ = rnnz_.cumsum()
    mask = indptr_ < ndata
    indmap -= np.int_(np.bincount(indptr_[mask],
                                  weights=rnnz_[mask],
                                  minlength=ndata))
    jumps = (rnnz * repeats).cumsum()
    mask = jumps < ndata
    indmap += np.int_(np.bincount(jumps[mask],
                                  weights=rnnz[mask],
                                  minlength=ndata))
    indmap = indmap.cumsum()
    return sparse.csr_matrix((csr.data[indmap],
                              csr.indices[indmap],
                              np.r_[0, indptr_]),
                             shape=(np.sum(repeats), csr.shape[1]))

and be reasonably efficient, but I'd rather not monkey patch the class. Is there a better way to do this?

Edit

As I revisit this question, I wonder why I posted it in the first place. Almost everything I could think to do with the repeated matrix would be easier to do with the original matrix, and then apply the repetition afterwards. My assumption is that post repetition will always be the better way to approach this problem than any of the potential answers.


回答1:


from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row



回答2:


It's not surprising that np.repeat does not work. It delegates the action to the hardcoded a.repeat method, and failing that, first turns a into an array (object if needed).

In the linear algebra world where sparse code was developed, most of the assembly work was done on the row, col, data arrays BEFORE creating the sparse matrix. The focus was on efficient math operations, and not so much on adding/deleting/indexing rows and elements.

I haven't worked through your code, but I'm not surprised that a csr format matrix requires that much work.

I worked out a similar function for the lil format (working from lil.copy):

def lil_repeat(S, repeat):
    # row repeat for lil sparse matrix
    # test for lil type and/or convert
    shape=list(S.shape)
    if isinstance(repeat, int):
        shape[0]=shape[0]*repeat
    else:
        shape[0]=sum(repeat)
    shape = tuple(shape)
    new = sparse.lil_matrix(shape, dtype=S.dtype)
    new.data = S.data.repeat(repeat) # flat repeat
    new.rows = S.rows.repeat(repeat)
    return new

But it is also possible to repeat using indices. Both lil and csr support indexing that is close to that of regular numpy arrays (at least in new enough versions). Thus:

S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A

give the same result

and best of all?

print S[np.arange(3).repeat([1,2,3]),:].A



回答3:


After someone posted a really clever response for how best to do this I revisited my original question, to see if there was an even better way. I I came up with one more way that has some pros and cons. Instead of repeating all of the data (as is done with the accepted answer), we can instead instruct scipy to reuse the data of the repeated rows, creating something akin to a view of the original sparse array (as you might do with broadcast_to). This can be done by simply tiling the indptr field.

repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))

This technique repeats the vector repeat_num times, while only modifying the the indptr. The downside is that due to the way the csr matrices encode data, instead of creating a matrix that's repeat_num x n in dimension, it creates one that's (2 * repeat_num - 1) x n where every odd row is 0. This shouldn't be too big of a deal as any operation will be quick given that each row is 0, and they should be pretty easy to slice out afterwards (with something like [::2]), but it's not ideal.

I think the marked answer is probably still the "best" way to do this.



来源:https://stackoverflow.com/questions/25335095/repeat-a-scipy-csr-sparse-matrix-along-axis-0

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!