val df1 = sc.parallelize(Seq(
(\"a1\",10,\"ACTIVE\",\"ds1\"),
(\"a1\",20,\"ACTIVE\",\"ds1\"),
(\"a2\",50,\"ACTIVE\",\"ds1\"),
(\"a3\",60,\"ACTIVE\",\"ds1\"))
First, a small thing. I use different names for the columns in df2
:
val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")
No big deal, but this made things easier for me to reason about.
Now for the fun stuff. I am going to be a bit verbose for the sake of clarity:
val join = df1
.join(df2, df1("c1") === df2("d1"), "inner")
.select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
.dropDuplicates
Here I do the following:
df1
and df2
on the c1
and d1
columnsdf2
columns and simply "hardcode" ds1
in the last column to replace ds2
This basically just filters out everything in df2
that does not have a corresponding key in c1
in df1
.
Next I diff:
val diff = join
.except(df1)
.select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")
This is a basic set operation that finds everything in join
that is not in df1
. These are the items to deactivate, so I select all the columns but replace the third with a hardcoded INACTIVE
value.
All that's left is to put them all together:
df1.union(diff)
This simply combines df1
with the table of deactivated values we calculated earlier to produce the final result:
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
+---+---+--------+---+
And again, you don't need all these intermediate values. I just was verbose to help trace through the process.
here is dirty solution -
from pyspark.sql import functions as F
# find the rows from df2 that have matching key c1 in df2
df3 = df1.join(df2,df1.c1==df2.c1)\
.select(df2.c1,df2.c2,df2.c3,df2.c5.alias('c4'))\
.dropDuplicates()
df3.show()
:
+---+---+------+---+
| c1| c2| c3| c4|
+---+---+------+---+
| a1| 10|ACTIVE|ds2|
| a1| 20|ACTIVE|ds2|
| a1| 30|ACTIVE|ds2|
| a1| 40|ACTIVE|ds2|
+---+---+------+---+
:
# Union df3 with df1 and change columns c3 and c4 if c4 value is 'ds2'
df1.union(df3).dropDuplicates(['c1','c2'])\
.select('c1','c2',\
F.when(df1.c4=='ds2','INACTIVE').otherwise('ACTIVE').alias('c3'),
F.when(df1.c4=='ds2','ds1').otherwise('ds1').alias('c4')
)\
.orderBy('c1','c2')\
.show()
:
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
+---+---+--------+---+
Enjoyed the challenge and here is my solution.
val c1keys = df1.select("c1").distinct
val df2_in_df1 = df2.join(c1keys, Seq("c1"), "inner")
val df2inactive = df2_in_df1.join(df1, Seq("c1", "c2"), "leftanti").withColumn("c3", lit("INACTIVE"))
scala> df1.union(df2inactive).show
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds2|
| a1| 40|INACTIVE|ds2|
+---+---+--------+---+