String similarity with OR condition in MinHash Spark ML

前端 未结 1 748
别跟我提以往
别跟我提以往 2021-01-27 04:21

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm.

val dataset1         


        
相关标签:
1条回答
  • 2021-01-27 04:51

    I don't think that it is possible to set two input columns (one dataString column for each used element a' or b') and then use OR while computing but you can transform dataset1 to represent both x' + y' + a' and x' + y' + b' variants and then do the distance computation. It won't give you exactly the same answer as if you were selecting a' or b' based on the corresponding row in dataset2 (I think you know how to do that expensive operation) but still give some sense of similarity.

    val dataset1splitted =
        dataset1
        .withColumn( "a", explode( array( "a'", "b'" ) ) )
        .drop( "a'", "b'", "dataString" )
        .withColumn( "dataString", concat_ws( "|", $"x'", $"y'", $"a" ) )
    
    0 讨论(0)
提交回复
热议问题