Apache Spark Python Cosine Similarity over DataFrames

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-27 17:51:38

问题


For a Recommender System, I need to compute the cosine similarity between all the columns of a whole Spark DataFrame.

In Pandas I used to do this:

import sklearn.metrics as metrics
import pandas as pd

df= pd.DataFrame(...some dataframe over here :D ...)
metrics.pairwise.cosine_similarity(df.T,df.T)

That generates the Similarity Matrix between the columns (since I used the transposition)

Is there any way to do the same thing in Spark (Python)?

(I need to apply this to a matrix made of tens of millions of rows, and thousands of columns, so that's why I need to do it in Spark)


回答1:


You can use the built-in columnSimilarities() method on a RowMatrix, that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a threshold.

Here's a small reproducible example:

from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])

# Convert to RowMatrix
mat = RowMatrix(rows)

# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)

# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
 MatrixEntry(1, 2, 0.998441152599),
 MatrixEntry(0, 1, 0.997463284056)]


来源:https://stackoverflow.com/questions/43921636/apache-spark-python-cosine-similarity-over-dataframes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!