问题
I have a purchase data (df_temp). I managed to replace using Pandas Dataframe to using a sparse csr_matrix because I have lots of products (89000) which I have to get their user-item information (purchased or not purchased) and then calculate the similarities between products.
First, I converted Pandas DataFrame to Numpy array:
df_user_product = df_temp[['user_id','product_id']].copy()
ar1 = np.array(df_user_product.to_records(index=False))
Second, created a coo_matrix because it's known for being fast in sparse matrix construction.
rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)
cols, c_pos = np.unique(ar1['user_id'], return_inverse=True)
s = sparse.coo_matrix((np.ones(r_pos.shape,int), (r_pos, c_pos)))
Third, for matrix calculations, it's better to use csr_matrix or csc_matrix, so I used csr_matrix as I have the product_id(s) in rows => more effective row slicing than csc_matrix.
sparse_csr_mat = s.tocsr()
sparse_csr_mat[sparse_csr_mat > 1] = 1
Then, I calculated the cosine similarity between products and put the result in similarities:
import sklearn.preprocessing as pp
col_normed_mat = pp.normalize(sparse_csr_mat, axis=1)
similarities = col_normed_mat * col_normed_mat.T
Which is:
<89447x89447 sparse matrix of type '<type 'numpy.float64'>'
with 1332945 stored elements in Compressed Sparse Row format>
Now, I want to have at the end a dictionary where for each product, there is the 5 most similar products. How to do it? I don't want to convert the sparse matrix to a dense array because of memory usage constraints. But I also didn't know if there is a way to access the csr_matrix like we do for array where we check for example index=product_id and get all the rows where the index=product_id, that way I will get all the similar products to product_id and sort by cosine similarity value to get the 5 most similar.
For example, a row in similarities matrix:
(product_id1, product_id2) 0.45
how to filter on only the X (=5 in my case) most similar products to product_id1, without having to convert the matrix to an array?
Looking in Stackoverflow, I think lil_matrix can be used for this case? how?
Thanks for the help!
回答1:
I finally understood how I can get the 5 most similar items to each products and this is by using .tolil()
matrix and then convert each row to a numpy array and use argsort
to get the 5 most similar items. I used @hpaulj solution suggested in this link.
def max_n(row_data, row_indices, n):
i = row_data.argsort()[-n:]
# i = row_data.argpartition(-n)[-n:]
top_values = row_data[i]
top_indices = row_indices[i] # do the sparse indices matter?
return top_values, top_indices, i
and then I applied it to one row for testing:
top_v, top_ind, ind = max_n(np.array(arr_ll.data[0]),np.array(arr_ll.rows[0]),5)
What I need is the top_indices
which are the indices of the 5 most similar products, but those indices are not the real product_id
. I mapped them when I constructed the coo_matrix
rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)
But how to get the real product_id
back from the indices?
Now for example I have:
top_ind = [2 1 34 9 123]
How to know 2
correspond to what product_id
, 1
to what, etc?
来源:https://stackoverflow.com/questions/52316812/using-csr-matrix-of-items-similarities-to-get-most-similar-items-to-item-x-witho