Filter dataframe by value NOT present in column of other dataframe [duplicate]

后端未结

关注

 2  434

暗喜

相关标签:

2条回答

失恋的感觉

2021-01-12 18:32

I've discovered that I can solve this using a simpler method - it seems that an antijoin is possible as a parameter to the join method, but the Spark Scaladoc does not describe it:

import org.apache.spark.sql.functions._

val df1 = Seq(("Hampstead", "London"), 
              ("Spui", "Amsterdam"), 
              ("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")

df1.join(df2, df1("city") === df2("cities"), "leftanti").show

Results in:

+----------+-------+ 
|  location|   city| 
+----------+-------+ 
|Chittagong|Chennai| 
+----------+-------+

P.S. thanks for the pointer to the duplicate - duly marked as such

0 讨论(0)

轮回少年

2021-01-12 18:41

If you are trying to filter a DataFrame using another, you should use join (or any of its variants). If what you need is to filter it using a List or any data structure that fits in your master and workers you could broadcast it, then reference it inside the filter or where method.

For instance I would do something like:

import org.apache.spark.sql.functions._

val df1 = Seq(("Hampstead", "London"), 
              ("Spui", "Amsterdam"), 
              ("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")

df2.join(df1, joinExprs=df1("city") === df2("cities"), joinType="full_outer")
   .select("city", "cities")
   .where(isnull($"cities"))
   .drop("cities").show()

0 讨论(0)

热议问题