How to join 2 dataframes in Spark based on a wildcard/regex condition?

后端未结

关注

 1  715

I have 2 dataframes df1 and df2. Suppose there is a location column in df1 which may contain a regular URL or a URL with a wi

相关标签:

1条回答

眼角桃花

2021-01-25 00:16

There exist "magic" matches operator - it is called rlike

val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")

df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
|                 url|            location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+

however there are some caveats:

Patterns have to be proper regular expressions, anchored in case of leading / trailing wildcards.

It is executed with Cartesian product (How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?):

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
:- *Project [value#215 AS url#217]
:  +- *Filter isnotnull(value#215)
:     +- LocalTableScan [value#215]
+- BroadcastExchange IdentityBroadcastMode
   +- *Project [value#209 AS location#211]
      +- *Filter isnotnull(value#209)
         +- LocalTableScan [value#209]

It is possible to filter candidates using method I described in Efficient string matching in Apache Spark

0 讨论(0)