问题
I need k nearest neighbors for each feature vector in the dataframe. I'm using BucketedRandomProjectionLSHModel from pyspark.
code for creating the model
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=n)
model = brp.fit(data_df)
df_lsh = model.transform(data_df)
Now, How do I run approx nearest neighbor query for each point in data_df.
I have tried broadcasting the model but got pickle error.
Also, defining a udf to access the model gives error Method __getstate__([]) does not exist
回答1:
Use should use .approxSimilarityJoin
model.df_lsh(df_lsh, df_lsh)
来源:https://stackoverflow.com/questions/46119437/using-lsh-in-spark-to-run-nearest-neighbors-query-on-every-point-in-dataframe