问题
I have a RDD consisting of dense vectors which contain probability distribution like below
[DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]),
DenseVector([0.2252, 0.0422, 0.0864, 0.0441, 0.0592, 0.0439, 0.0433, 0.071, 0.1644, 0.0405, 0.0581, 0.0528, 0.0691]),
DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]),
DenseVector([0.0924, 0.0699, 0.083, 0.0706, 0.0766, 0.0708, 0.0705, 0.0793, 0.09, 0.0689, 0.0758, 0.0743, 0.0779]),
DenseVector([0.0806, 0.0751, 0.0785, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]),
DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773])
I want to calculate similarities between a vector and all the other vectors and store the result in a matrix.
I could convert the full RDD into a matrix and then take each row and calculate the distance against all the other rows. I was wondering if there is a more efficient way to do this using pyspark RDD methods.
回答1:
As far as I know there isn't a function for doing cosine similarities between rows. So you will have to be a little tricky to get where you want.
First create pairs of rows in a column format by using rdd.cartesian(rdd), this will match up all of the rows with each other in pairs. Next you will need to define a cosine similarity function and map it over the rdd. Finally, cast the result to a np.array and reshape to 6x6.
Example:
def cos_sim(row):
dot_product = row[0].dot(row[1])
norm_a = np.sqrt(np.sum(row[0] * row[0]))
norm_b = np.sqrt(np.sum(row[1] * row[1]))
sim = dot_product / (norm_a * norm_b)
return sim
rdd2 = rdd.cartesian(rdd)
cosine_similarities = rdd2.map(lambda x: cos_sim(x)).collect()
cosine_similariteis = np.array(cosine_similarities).reshape((6,6))
来源:https://stackoverflow.com/questions/42659307/pyspark-calculate-custom-distance-between-all-vectors-in-a-rdd