I have a table of around 50k distinct rows, and 2 columns. You can think of each row being a movie, and columns being the attributes of that movie - \"ID\": id of that movie, \"
You could try a solution similar to this stackoverflow answer, though since your data is already tokenized (a list of strings), you wouldn't need to do that step or the ngram step.
I'll also mention that the approxSimilarityJoin in pyspark calculates the Jaccard Distance rather than the Jaccard Similarity, but you can just subtract from 1 to convert back to the Similarity if you need that in particular.
Your code would end up looking similar to:
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, MinHashLSH
import pyspark.sql.functions as f
db = spark.createDataFrame([
('movie_1', ['romantic','comedy','English']),
('movie_2', ['action','kongfu','Chinese']),
('movie_3', ['romantic', 'action'])
], ['movie_id', 'genres'])
model = Pipeline(stages=[
HashingTF(inputCol="genres", outputCol="vectors"),
MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=10)
]).fit(db)
db_hashed = model.transform(db)
db_matches = model.stages[-1].approxSimilarityJoin(db_hashed, db_hashed, 0.9)
#show all matches (including duplicates)
db_matches.select(f.col('datasetA.movie_id').alias('movie_id_A'),
f.col('datasetB.movie_id').alias('movie_id_B'),
f.col('distCol')).show()
#show non-duplicate matches
db_matches.select(f.col('datasetA.movie_id').alias('movie_id_A'),
f.col('datasetB.movie_id').alias('movie_id_B'),
f.col('distCol')).filter('movie_id_A < movie_id_B').show()
With the corresponding output:
+----------+----------+-------+
|movie_id_A|movie_id_B|distCol|
+----------+----------+-------+
| movie_3| movie_3| 0.0|
| movie_1| movie_3| 0.75|
| movie_2| movie_3| 0.75|
| movie_1| movie_1| 0.0|
| movie_2| movie_2| 0.0|
| movie_3| movie_2| 0.75|
| movie_3| movie_1| 0.75|
+----------+----------+-------+
+----------+----------+-------+
|movie_id_A|movie_id_B|distCol|
+----------+----------+-------+
| movie_1| movie_3| 0.75|
| movie_2| movie_3| 0.75|
+----------+----------+-------+