I have 2 dataframes df1
and df2
.
Suppose there is a location
column in df1
which may contain a regular URL or a URL with a wi
There exist "magic" matches operator - it is called rlike
val df1 = Seq("stackoverflow.com/questions/.*$","^*.cnn.com$", "nn.com/*/politics").toDF("location")
val df2 = Seq("stackoverflow.com/questions/47272330").toDF("url")
df2.join(df1, expr("url rlike location")).show
+--------------------+--------------------+
| url| location|
+--------------------+--------------------+
|stackoverflow.com...|stackoverflow.com...|
+--------------------+--------------------+
however there are some caveats:
It is executed with Cartesian product (How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?):
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, url#217 RLIKE location#211
:- *Project [value#215 AS url#217]
: +- *Filter isnotnull(value#215)
: +- LocalTableScan [value#215]
+- BroadcastExchange IdentityBroadcastMode
+- *Project [value#209 AS location#211]
+- *Filter isnotnull(value#209)
+- LocalTableScan [value#209]
It is possible to filter candidates using method I described in Efficient string matching in Apache Spark