问题
I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between rows. --> This is need for Item-Item Recommender System.
All that I've read about it is referred to computing similarity over columns Apache Spark Python Cosine Similarity over DataFrames May someone say is it possible to compute a cosine distance elegantly between rows using PySpark Dataframe's API or RDD's or I have to do it manually?
That's just some code to show what I intend to do
def cosineSimilarity(vec1, vec2):
return vec1.dot(vec2) / (LA.norm(vec1) * LA.norm(vec2))
#p.s model is ALS
Pred_Factors = model.itemFactors.cache() #Pred_Factors = DataFrame[id: int, features: array<float>]
sims = []
for _id,_feature in Pred_Factors.toLocalIterator():
for id, feature in Pred_Factors.toLocalIterator():
itemFactor = _feature
sims = sims.append(_id, cosineSimilarity(asarray(feature),itemFactor))
sims = sc.parallelize(l)
sortedSims = sims.takeOrdered(10, key=lambda x: -x[1])
Thanks in Advance for all the help
回答1:
You can use mllib.feature.IndexedRowMatrix
's columnSimilarities
function. It uses cosine metrics as distance function. It computes similarities between columns so, you have to take transpose before applying this function.
pred_ = IndexedRowMatrix(Pred_Factors.rdd.map(lambda x: IndexedRow(x[0],x[1]))).toBlockMatrix().transpose().toIndexedRowMatrix()
pred_sims = pred.columnSimilarities()
来源:https://stackoverflow.com/questions/46663775/spark-cosine-distance-between-rows-using-dataframe