Spark Python: How to calculate Jaccard Similarity between each line within an RDD?

后端 未结 1 1930
-上瘾入骨i
-上瘾入骨i 2021-01-21 11:50

I have a table of around 50k distinct rows, and 2 columns. You can think of each row being a movie, and columns being the attributes of that movie - \"ID\": id of that movie, \"

1条回答
  •  时光说笑
    2021-01-21 12:11

    You could try a solution similar to this stackoverflow answer, though since your data is already tokenized (a list of strings), you wouldn't need to do that step or the ngram step.

    I'll also mention that the approxSimilarityJoin in pyspark calculates the Jaccard Distance rather than the Jaccard Similarity, but you can just subtract from 1 to convert back to the Similarity if you need that in particular.

    Your code would end up looking similar to:

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import HashingTF, MinHashLSH
    import pyspark.sql.functions as f
    
    db = spark.createDataFrame([
            ('movie_1', ['romantic','comedy','English']),
            ('movie_2', ['action','kongfu','Chinese']),
            ('movie_3', ['romantic', 'action'])
        ], ['movie_id', 'genres'])
    
    
    model = Pipeline(stages=[
            HashingTF(inputCol="genres", outputCol="vectors"),
            MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=10)
        ]).fit(db)
    
    db_hashed = model.transform(db)
    
    db_matches = model.stages[-1].approxSimilarityJoin(db_hashed, db_hashed, 0.9)
    
    #show all matches (including duplicates)
    db_matches.select(f.col('datasetA.movie_id').alias('movie_id_A'),
                     f.col('datasetB.movie_id').alias('movie_id_B'),
                     f.col('distCol')).show()
    
    #show non-duplicate matches
    db_matches.select(f.col('datasetA.movie_id').alias('movie_id_A'),
                     f.col('datasetB.movie_id').alias('movie_id_B'),
                     f.col('distCol')).filter('movie_id_A < movie_id_B').show()
    

    With the corresponding output:

    +----------+----------+-------+
    |movie_id_A|movie_id_B|distCol|
    +----------+----------+-------+
    |   movie_3|   movie_3|    0.0|
    |   movie_1|   movie_3|   0.75|
    |   movie_2|   movie_3|   0.75|
    |   movie_1|   movie_1|    0.0|
    |   movie_2|   movie_2|    0.0|
    |   movie_3|   movie_2|   0.75|
    |   movie_3|   movie_1|   0.75|
    +----------+----------+-------+
    
    +----------+----------+-------+
    |movie_id_A|movie_id_B|distCol|
    +----------+----------+-------+
    |   movie_1|   movie_3|   0.75|
    |   movie_2|   movie_3|   0.75|
    +----------+----------+-------+
    

    0 讨论(0)
提交回复
热议问题