可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am using python scikit-learn
for document clustering and I have a sparse matrix stored in a dict
object:
For example:
doc_term_dict = { ('d1','t1'): 12, \ ('d2','t3'): 10, \ ('d3','t2'): 5 \ } # from mysql data table <type 'dict'>
I want to use scikit-learn
to do the clustering where the input matrix type is scipy.sparse.csr.csr_matrix
Example:
(0, 2164) 0.245793088885 (0, 2076) 0.205702177467 (0, 2037) 0.193810934784 (0, 2005) 0.14547028437 (0, 1953) 0.153720023365 ... <class 'scipy.sparse.csr.csr_matrix'>
I can't find a way to convert dict
to this csr-matrix (I have never used scipy
.)
回答1:
Pretty straightforward. First read the dictionary and convert the keys to the appropriate row and column. Scipy supports (and recommends for this purpose) the COO-rdinate format for sparse matrices.
Pass it data
, row
, and column
, where A[row[k], column[k] = data[k]
(for all k) defines the matrix. Then let Scipy do the conversion to CSR.
Please check, that I have rows and columns in the way you want them, I might have them transposed. I also assumed that the input would be 1-indexed.
My code below prints:
(0, 0) 12 (1, 2) 10 (2, 1) 5
Code:
#!/usr/bin/env python3 #http://stackoverflow.com/questions/26335059/converting-python-sparse-matrix-dict-to-scipy-sparse-matrix from scipy.sparse import csr_matrix, coo_matrix def convert(term_dict): ''' Convert a dictionary with elements of form ('d1', 't1'): 12 to a CSR type matrix. The element ('d1', 't1'): 12 becomes entry (0, 0) = 12. * Conversion from 1-indexed to 0-indexed. * d is row * t is column. ''' # Create the appropriate format for the COO format. data = [] row = [] col = [] for k, v in term_dict.items(): r = int(k[0][1:]) c = int(k[1][1:]) data.append(v) row.append(r-1) col.append(c-1) # Create the COO-matrix coo = coo_matrix((data,(row,col))) # Let Scipy convert COO to CSR format and return return csr_matrix(coo) if __name__=='__main__': doc_term_dict = { ('d1','t1'): 12, \ ('d2','t3'): 10, \ ('d3','t2'): 5 \ } print(convert(doc_term_dict))
回答2:
We can make @Unapiedra's (excellent) answer a little more sparse:
from scipy.sparse import csr_matrix def _dict_to_csr(term_dict): term_dict_v = list(term_dict.itervalues()) term_dict_k = list(term_dict.iterkeys()) shape = list(repeat(np.asarray(term_dict_k).max() + 1,2)) csr = csr_matrix((term_dict_v, zip(*term_dict_k)), shape = shape) return csr
回答3:
Same as @carsonc, but for Python 3.X :
from scipy.sparse import csr_matrix def _dict_to_csr(term_dict): term_dict_v = term_dict.values() term_dict_k = term_dict.keys() term_dict_k_zip = zip(*term_dict_k) term_dict_k_zip_list = list(term_dict_k_zip) shape = (len(term_dict_k_zip_list[0]), len(term_dict_k_zip_list[1])) csr = csr_matrix((list(term_dict_v), list(map(list, zip(*term_dict_k)))), shape = shape) return csr