问题
I have a PySpark DataFrame, df1, that looks like:
CustomerID CustomerValue CustomerValue2
12 .17 .08
I have a second PySpark DataFrame, df2
CustomerID CustomerValue CustomerValue
15 .17 .14
16 .40 .43
18 .86 .09
I want to take the cosine similarity of the two dataframes. And have something like that
CustomerID CustomerID CosineCustVal CosineCustVal
15 12 1 .90
16 12 .45 .67
18 12 .8 .04
回答1:
You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the columns called CustomerValue are the different components of a vector that represents the feature you want to get the similarities for between two customers, you can do it by transposing the data frame and then do a join on the CuatomerValues.
The transposition can be done with an explode (more details about transposing a data frame here):
from pyspark.sql import functions as F
kvs = F.explode(F.array([
F.struct(F.lit(c).alias('key'), F.columm(c).alias('value')) for c in ['CustomerValue1', 'CustomerValue2']
])).alias('kvs')
dft1 = (df1.select(['CustomerID', kvs])
.select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
)
dft2 = (df2.select(['CustomerID', kvs])
.select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
)
where dft1
and dft2
denote the transposed data frames. Once you transposed them, you can join them on the column names:
dft2 = (dft2.withColumnRenamed('CustomerID', 'CustomerID2')
.withColumnRenamed('column_value', 'column_value2')
)
cosine = (dft1.join(dft2, dft1.column_name = dft2.column_name)
.groupBy('CustomerID' , 'CustomerID2')
.agg(F.sum(F.column('column_value')*F.column('column_value2')).alias('cosine_similarity'))
)
Now in cosine
you have three columns: the CustomerID from the first and second data frames and the cosine similarity (provided that the values were normalized first). This has the oadvantage that you only have rows for CustomerID pairs that have a nonzero similarity (in case of zero values for some CustomerIDs). For your example:
df1:
CustomerID CustomerValue CustomerValue2
12 .17 .08
df2:
CustomerID CustomerValue CustomerValue
15 .17 .14
16 .40 .43
18 .86 .09
cosine:
CustomID CustomID2 cosine_similarity
12 15 .0401
12 16 .1024
12 18 .1534
Of course these are not the real cosine similarities yet, you need to normalize the values first. You can do that with a group by:
(df.groupBy('CustomerID')
.agg(F.sqrt(F.sum(F.column('column_value')*F.column('column_value'))).alias('norm'))
.select('CustomerID', F.column('column_name'), (F.column('column_value')/F.column('norm')).alias('column_value_norm'))
)
After normalizing the columns your cosine similarities become the following:
CustomID CustomID2 cosine_similarity
12 15 .970
12 16 .928
12 18 .945
The large similarity values are due to the low dimensionality (two components only).
来源:https://stackoverflow.com/questions/52542903/cosine-similarity-for-two-pyspark-dataframes