How to join two DataFrames and change column for missing values?

后端 未结 3 1532
余生分开走
余生分开走 2021-01-26 08:46
val df1 = sc.parallelize(Seq(
   (\"a1\",10,\"ACTIVE\",\"ds1\"),
   (\"a1\",20,\"ACTIVE\",\"ds1\"),
   (\"a2\",50,\"ACTIVE\",\"ds1\"),
   (\"a3\",60,\"ACTIVE\",\"ds1\"))         


        
相关标签:
3条回答
  • 2021-01-26 09:02

    First, a small thing. I use different names for the columns in df2:

    val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")
    

    No big deal, but this made things easier for me to reason about.

    Now for the fun stuff. I am going to be a bit verbose for the sake of clarity:

    val join = df1
    .join(df2, df1("c1") === df2("d1"), "inner")
    .select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
    .dropDuplicates
    

    Here I do the following:

    • Inner join between df1 and df2 on the c1 and d1 columns
    • Select the df2 columns and simply "hardcode" ds1 in the last column to replace ds2
    • Drop duplicates

    This basically just filters out everything in df2 that does not have a corresponding key in c1 in df1.

    Next I diff:

    val diff = join
    .except(df1)
    .select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")
    

    This is a basic set operation that finds everything in join that is not in df1. These are the items to deactivate, so I select all the columns but replace the third with a hardcoded INACTIVE value.

    All that's left is to put them all together:

    df1.union(diff)
    

    This simply combines df1 with the table of deactivated values we calculated earlier to produce the final result:

    +---+---+--------+---+
    | c1| c2|      c3| c4|
    +---+---+--------+---+
    | a1| 10|  ACTIVE|ds1|
    | a1| 20|  ACTIVE|ds1|
    | a2| 50|  ACTIVE|ds1|
    | a3| 60|  ACTIVE|ds1|
    | a1| 30|INACTIVE|ds1|
    | a1| 40|INACTIVE|ds1|
    +---+---+--------+---+
    

    And again, you don't need all these intermediate values. I just was verbose to help trace through the process.

    0 讨论(0)
  • 2021-01-26 09:10

    here is dirty solution -

    from pyspark.sql import functions as F
    
    
    # find the rows from df2 that have matching key c1 in df2
    df3 = df1.join(df2,df1.c1==df2.c1)\
    .select(df2.c1,df2.c2,df2.c3,df2.c5.alias('c4'))\
    .dropDuplicates()
    
    df3.show()
    

    :

    +---+---+------+---+
    | c1| c2|    c3| c4|
    +---+---+------+---+
    | a1| 10|ACTIVE|ds2|
    | a1| 20|ACTIVE|ds2|
    | a1| 30|ACTIVE|ds2|
    | a1| 40|ACTIVE|ds2|
    +---+---+------+---+
    

    :

    # Union df3 with df1 and change columns c3 and c4 if c4 value is 'ds2'
    
    df1.union(df3).dropDuplicates(['c1','c2'])\
    .select('c1','c2',\
            F.when(df1.c4=='ds2','INACTIVE').otherwise('ACTIVE').alias('c3'),
            F.when(df1.c4=='ds2','ds1').otherwise('ds1').alias('c4')
           )\
    .orderBy('c1','c2')\
    .show()
    

    :

    +---+---+--------+---+
    | c1| c2|      c3| c4|
    +---+---+--------+---+
    | a1| 10|  ACTIVE|ds1|
    | a1| 20|  ACTIVE|ds1|
    | a1| 30|INACTIVE|ds1|
    | a1| 40|INACTIVE|ds1|
    | a2| 50|  ACTIVE|ds1|
    | a3| 60|  ACTIVE|ds1|
    +---+---+--------+---+
    
    0 讨论(0)
  • 2021-01-26 09:12

    Enjoyed the challenge and here is my solution.

    val c1keys = df1.select("c1").distinct
    val df2_in_df1 = df2.join(c1keys, Seq("c1"), "inner")
    val df2inactive = df2_in_df1.join(df1, Seq("c1", "c2"), "leftanti").withColumn("c3", lit("INACTIVE"))
    scala> df1.union(df2inactive).show
    +---+---+--------+---+
    | c1| c2|      c3| c4|
    +---+---+--------+---+
    | a1| 10|  ACTIVE|ds1|
    | a1| 20|  ACTIVE|ds1|
    | a2| 50|  ACTIVE|ds1|
    | a3| 60|  ACTIVE|ds1|
    | a1| 30|INACTIVE|ds2|
    | a1| 40|INACTIVE|ds2|
    +---+---+--------+---+
    
    0 讨论(0)
提交回复
热议问题