Cosine Similarity for two pyspark dataframes

穿精又带淫゛_ 提交于 2021-02-10 19:31:21

问题


I have a PySpark DataFrame, df1, that looks like:

CustomerID  CustomerValue CustomerValue2 
12          .17           .08

I have a second PySpark DataFrame, df2

 CustomerID  CustomerValue CustomerValue
 15          .17           .14
 16          .40           .43
 18          .86           .09

I want to take the cosine similarity of the two dataframes. And have something like that

 CustomerID  CustomerID   CosineCustVal CosineCustVal
 15          12           1            .90
 16          12           .45          .67
 18          12           .8           .04

回答1:


You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the columns called CustomerValue are the different components of a vector that represents the feature you want to get the similarities for between two customers, you can do it by transposing the data frame and then do a join on the CuatomerValues.

The transposition can be done with an explode (more details about transposing a data frame here):

from pyspark.sql import functions as F

kvs = F.explode(F.array([
        F.struct(F.lit(c).alias('key'), F.columm(c).alias('value')) for c in ['CustomerValue1', 'CustomerValue2']
      ])).alias('kvs')

dft1 = (df1.select(['CustomerID', kvs])
        .select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
        )
dft2 = (df2.select(['CustomerID', kvs])
        .select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
        )

where dft1 and dft2 denote the transposed data frames. Once you transposed them, you can join them on the column names:

dft2 = (dft2.withColumnRenamed('CustomerID', 'CustomerID2')
        .withColumnRenamed('column_value', 'column_value2')
       )
cosine = (dft1.join(dft2, dft1.column_name = dft2.column_name)
          .groupBy('CustomerID' , 'CustomerID2')
          .agg(F.sum(F.column('column_value')*F.column('column_value2')).alias('cosine_similarity'))
         )

Now in cosine you have three columns: the CustomerID from the first and second data frames and the cosine similarity (provided that the values were normalized first). This has the oadvantage that you only have rows for CustomerID pairs that have a nonzero similarity (in case of zero values for some CustomerIDs). For your example:

df1:

CustomerID CustomerValue CustomerValue2
12         .17           .08

df2:

CustomerID CustomerValue CustomerValue
15         .17           .14
16         .40           .43
18         .86           .09

cosine:

CustomID CustomID2 cosine_similarity
12       15        .0401
12       16        .1024
12       18        .1534

Of course these are not the real cosine similarities yet, you need to normalize the values first. You can do that with a group by:

(df.groupBy('CustomerID')
 .agg(F.sqrt(F.sum(F.column('column_value')*F.column('column_value'))).alias('norm'))
 .select('CustomerID', F.column('column_name'), (F.column('column_value')/F.column('norm')).alias('column_value_norm'))
)

After normalizing the columns your cosine similarities become the following:

CustomID CustomID2 cosine_similarity
12       15        .970
12       16        .928
12       18        .945

The large similarity values are due to the low dimensionality (two components only).



来源:https://stackoverflow.com/questions/52542903/cosine-similarity-for-two-pyspark-dataframes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!