How to join 2 dataframes in Spark based on a wildcard/regex condition?

后端 未结 1 715
南方客
南方客 2021-01-24 23:38

I have 2 dataframes df1 and df2. Suppose there is a location column in df1 which may contain a regular URL or a URL with a wi

相关标签:
1条回答
  • 2021-01-25 00:16

    There exist "magic" matches operator - it is called rlike

    val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
    val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")
    
    df2.join(df1, expr("url rlike location")).show
    +--------------------+--------------------+
    |                 url|            location|
    +--------------------+--------------------+
    |stackoverflow.com...|stackoverflow.com...|
    +--------------------+--------------------+
    

    however there are some caveats:

    • Patterns have to be proper regular expressions, anchored in case of leading / trailing wildcards.
    • It is executed with Cartesian product (How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?):

      == Physical Plan ==
      BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
      :- *Project [value#215 AS url#217]
      :  +- *Filter isnotnull(value#215)
      :     +- LocalTableScan [value#215]
      +- BroadcastExchange IdentityBroadcastMode
         +- *Project [value#209 AS location#211]
            +- *Filter isnotnull(value#209)
               +- LocalTableScan [value#209]
      

    It is possible to filter candidates using method I described in Efficient string matching in Apache Spark

    0 讨论(0)
提交回复
热议问题