How to join two DataFrames and change column for missing values?

后端 未结 3 1533
余生分开走
余生分开走 2021-01-26 08:46
val df1 = sc.parallelize(Seq(
   (\"a1\",10,\"ACTIVE\",\"ds1\"),
   (\"a1\",20,\"ACTIVE\",\"ds1\"),
   (\"a2\",50,\"ACTIVE\",\"ds1\"),
   (\"a3\",60,\"ACTIVE\",\"ds1\"))         


        
3条回答
  •  南笙
    南笙 (楼主)
    2021-01-26 09:02

    First, a small thing. I use different names for the columns in df2:

    val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")
    

    No big deal, but this made things easier for me to reason about.

    Now for the fun stuff. I am going to be a bit verbose for the sake of clarity:

    val join = df1
    .join(df2, df1("c1") === df2("d1"), "inner")
    .select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
    .dropDuplicates
    

    Here I do the following:

    • Inner join between df1 and df2 on the c1 and d1 columns
    • Select the df2 columns and simply "hardcode" ds1 in the last column to replace ds2
    • Drop duplicates

    This basically just filters out everything in df2 that does not have a corresponding key in c1 in df1.

    Next I diff:

    val diff = join
    .except(df1)
    .select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")
    

    This is a basic set operation that finds everything in join that is not in df1. These are the items to deactivate, so I select all the columns but replace the third with a hardcoded INACTIVE value.

    All that's left is to put them all together:

    df1.union(diff)
    

    This simply combines df1 with the table of deactivated values we calculated earlier to produce the final result:

    +---+---+--------+---+
    | c1| c2|      c3| c4|
    +---+---+--------+---+
    | a1| 10|  ACTIVE|ds1|
    | a1| 20|  ACTIVE|ds1|
    | a2| 50|  ACTIVE|ds1|
    | a3| 60|  ACTIVE|ds1|
    | a1| 30|INACTIVE|ds1|
    | a1| 40|INACTIVE|ds1|
    +---+---+--------+---+
    

    And again, you don't need all these intermediate values. I just was verbose to help trace through the process.

提交回复
热议问题