How to join two DataFrames and change column for missing values?

后端未结

关注

 3  1532

val df1 = sc.parallelize(Seq(
   (\"a1\",10,\"ACTIVE\",\"ds1\"),
   (\"a1\",20,\"ACTIVE\",\"ds1\"),
   (\"a2\",50,\"ACTIVE\",\"ds1\"),
   (\"a3\",60,\"ACTIVE\",\"ds1\"))


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-01-26 09:02
              
            
            
                                                                       
First, a small thing. I use different names for the columns in df2:

val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")


No big deal, but this made things easier for me to reason about.

Now for the fun stuff. I am going to be a bit verbose for the sake of clarity:

val join = df1
.join(df2, df1("c1") === df2("d1"), "inner")
.select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
.dropDuplicates


Here I do the following:


Inner join between df1 and df2 on the c1 and d1 columns
Select the df2 columns and simply "hardcode" ds1 in the last column to replace ds2
Drop duplicates


This basically just filters out everything in df2 that does not have a corresponding key in c1 in df1.

Next I diff:

val diff = join
.except(df1)
.select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")


This is a basic set operation that finds everything in join that is not in df1. These are the items to deactivate, so I select all the columns but replace the third with a hardcoded INACTIVE value.

All that's left is to put them all together:

df1.union(diff)


This simply combines df1 with the table of deactivated values we calculated earlier to produce the final result:

+---+---+--------+---+
| c1| c2|      c3| c4|
+---+---+--------+---+
| a1| 10|  ACTIVE|ds1|
| a1| 20|  ACTIVE|ds1|
| a2| 50|  ACTIVE|ds1|
| a3| 60|  ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
+---+---+--------+---+


And again, you don't need all these intermediate values. I just was verbose to help trace through the process.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2021-01-26 09:10
              
            
            
                                                                       
here is dirty solution - 

from pyspark.sql import functions as F


# find the rows from df2 that have matching key c1 in df2
df3 = df1.join(df2,df1.c1==df2.c1)\
.select(df2.c1,df2.c2,df2.c3,df2.c5.alias('c4'))\
.dropDuplicates()

df3.show()


:

+---+---+------+---+
| c1| c2|    c3| c4|
+---+---+------+---+
| a1| 10|ACTIVE|ds2|
| a1| 20|ACTIVE|ds2|
| a1| 30|ACTIVE|ds2|
| a1| 40|ACTIVE|ds2|
+---+---+------+---+


:

# Union df3 with df1 and change columns c3 and c4 if c4 value is 'ds2'

df1.union(df3).dropDuplicates(['c1','c2'])\
.select('c1','c2',\
        F.when(df1.c4=='ds2','INACTIVE').otherwise('ACTIVE').alias('c3'),
        F.when(df1.c4=='ds2','ds1').otherwise('ds1').alias('c4')
       )\
.orderBy('c1','c2')\
.show()


:

+---+---+--------+---+
| c1| c2|      c3| c4|
+---+---+--------+---+
| a1| 10|  ACTIVE|ds1|
| a1| 20|  ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
| a2| 50|  ACTIVE|ds1|
| a3| 60|  ACTIVE|ds1|
+---+---+--------+---+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2021-01-26 09:12
              
            
            
                                                                       
Enjoyed the challenge and here is my solution.

val c1keys = df1.select("c1").distinct
val df2_in_df1 = df2.join(c1keys, Seq("c1"), "inner")
val df2inactive = df2_in_df1.join(df1, Seq("c1", "c2"), "leftanti").withColumn("c3", lit("INACTIVE"))
scala> df1.union(df2inactive).show
+---+---+--------+---+
| c1| c2|      c3| c4|
+---+---+--------+---+
| a1| 10|  ACTIVE|ds1|
| a1| 20|  ACTIVE|ds1|
| a2| 50|  ACTIVE|ds1|
| a3| 60|  ACTIVE|ds1|
| a1| 30|INACTIVE|ds2|
| a1| 40|INACTIVE|ds2|
+---+---+--------+---+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复