I\'m trying to build and update a sparse matrix as I read data from file.
The matrix is of size 100000X40000
What is the most efficient way of updating
import scipy.sparse
rows = [2, 236, 246, 389, 1691]
cols = [117, 3, 34, 2757, 74, 1635, 52]
prod = [(x, y) for x in rows for y in cols] # combinations
r = [x for (x, y) in prod] # x_coordinate
c = [y for (x, y) in prod] # y_coordinate
data = [1] * len(r)
m = scipy.sparse.coo_matrix((data, (r, c)), shape=(100000, 40000))
I think it works well and doesn't need loops. I am directly following the doc
<100000x40000 sparse matrix of type '<type 'numpy.int32'>'
with 35 stored elements in COOrdinate format>
This answer expands the comment of @behzad.nouri. To increment the values at the "outer product" of your lists of rows and columns indices, just create these as numpy arrays configured for broadcasting. In this case, that means put the rows into a column. For example,
In [59]: a = lil_matrix((4,4), dtype=int)
In [60]: a.A
Out[60]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
In [61]: rows = np.array([1,3]).reshape(-1, 1)
In [62]: rows
Out[62]:
array([[1],
[3]])
In [63]: cols = np.array([0, 2, 3])
In [64]: a[rows, cols] += np.ones((rows.size, cols.size))
In [65]: a.A
Out[65]:
array([[0, 0, 0, 0],
[1, 0, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 1]])
In [66]: rows = np.array([0, 1]).reshape(-1,1)
In [67]: cols = np.array([1, 2])
In [68]: a[rows, cols] += np.ones((rows.size, cols.size))
In [69]: a.A
Out[69]:
array([[0, 1, 1, 0],
[1, 1, 2, 1],
[0, 0, 0, 0],
[1, 0, 1, 1]])
Creating a second matrix with 1
s in your new coordinates and adding it to the existing one is a possible way of doing this:
>>> import scipy.sparse as sps
>>> shape = (1000, 2000)
>>> rows, cols = 1000, 2000
>>> sps_acc = sps.coo_matrix((rows, cols)) # empty matrix
>>> for j in xrange(100): # add 100 sets of 100 1's
... r = np.random.randint(rows, size=100)
... c = np.random.randint(cols, size=100)
... d = np.ones((100,))
... sps_acc = sps_acc + sps.coo_matrix((d, (r, c)), shape=(rows, cols))
...
>>> sps_acc
<1000x2000 sparse matrix of type '<type 'numpy.float64'>'
with 9985 stored elements in Compressed Sparse Row format>