问题
I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this.
MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features")
.setOutputCol("hashes");
MinHashLSHModel model = mh.fit(dataset);
Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance");
approxSimilarityJoin.show();
The job gets stuck at approxSimilarityJoin() function and never goes beyond it. Please let me know how to solve it.
回答1:
It will finish if you leave it long enough, however there are some things you can do to speed it up. Reviewing the source code you can see the algorithm
- hashes the inputs
- joins the 2 datasets on the hashes
- computes the jaccard distance using a udf and
- filters the dataset with your threshold.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
The join is probably the slow part here as the data is shuffled. So some things to try:
- change your dataframe input partitioning
- change
spark.sql.shuffle.partitions
(the default gives you 200 partitions after a join) - your dataset looks small enough where you could use
spark.sql.functions.broadcast(dataset)
for a map-side join - Are these vectors sparse or dense? the algorithm works better with
sparseVectors
.
Of these 4 options 2 and 3 have worked best for me while always using sparseVectors
.
来源:https://stackoverflow.com/questions/48927221/lsh-spark-stucks-forever-at-approxsimilarityjoin-function