Using LSH in spark to run nearest neighbors query on every point in dataframe

。_饼干妹妹 提交于 2019-12-24 12:34:45

问题


I need k nearest neighbors for each feature vector in the dataframe. I'm using BucketedRandomProjectionLSHModel from pyspark.

code for creating the model

brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=12345, bucketLength=n)

model = brp.fit(data_df)
df_lsh = model.transform(data_df)

Now, How do I run approx nearest neighbor query for each point in data_df.

I have tried broadcasting the model but got pickle error. Also, defining a udf to access the model gives error Method __getstate__([]) does not exist


回答1:


Use should use .approxSimilarityJoin

model.df_lsh(df_lsh, df_lsh)


来源:https://stackoverflow.com/questions/46119437/using-lsh-in-spark-to-run-nearest-neighbors-query-on-every-point-in-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!