I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm.
val dataset1
I don't think that it is possible to set two input columns (one dataString
column for each used element a'
or b'
) and then use OR while computing but you can transform dataset1
to represent both x' + y' + a'
and x' + y' + b'
variants and then do the distance computation. It won't give you exactly the same answer as if you were selecting a'
or b'
based on the corresponding row in dataset2
(I think you know how to do that expensive operation) but still give some sense of similarity.
val dataset1splitted =
dataset1
.withColumn( "a", explode( array( "a'", "b'" ) ) )
.drop( "a'", "b'", "dataString" )
.withColumn( "dataString", concat_ws( "|", $"x'", $"y'", $"a" ) )