PySpark - Explode columns into rows based on the type of the column

后端未结

关注

 2  2009

情歌与酒 2021-01-27 06:12

Given a Dataframe:

+---+-----------+---------+-------+------------+
| id|      score|tx_amount|isValid|    greeting|
+---+-----------+---------+-------+---------


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   不知归路
                                             
                
                
                (楼主)
            
              
              
                2021-01-27 06:41
              

            
            
                        
Sample DataFrame:

df.show()
df.printSchema()

+---+-----------+---------+-------+------------+
| id|model_score|tx_amount|isValid|    greeting|
+---+-----------+---------+-------+------------+
|  1|        0.2|    23.78|   true| hello_world|
|  2|        0.6|    12.41|  false|byebye_world|
+---+-----------+---------+-------+------------+

root
 |-- id: integer (nullable = true)
 |-- model_score: double (nullable = true)
 |-- tx_amount: double (nullable = true)
 |-- isValid: boolean (nullable = true)
 |-- greeting: string (nullable = true)


I tried to keep it dynamic for any input of columns. It will take type from df.dtypes[1:] because id is not included in col_value that is why skipping it(1:). Array only accepts same type in it, thats why we will convert all cols to string before applying the logic. I think it should work for your use case. You can build your Y/N cols from here.

df.select([F.col(c).cast("string") for c in df.columns])\
        .withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(x[0],F.lit(x[1]),F.lit(x[0]))\
                                                    for x in df.dtypes[1:]]))))\
        .select("id", F.col("cols.*")).withColumn("col_value", F.element_at("0",1))\
                                      .withColumn("col_type", F.element_at("0",2))\
                                      .withColumn("col_name", F.element_at("0",3)).drop("0").show()

+---+------------+--------+-----------+
| id|   col_value|col_type|   col_name|
+---+------------+--------+-----------+
|  1|         0.2|  double|model_score|
|  1|       23.78|  double|  tx_amount|
|  1|        true| boolean|    isValid|
|  1| hello_world|  string|   greeting|
|  2|         0.6|  double|model_score|
|  2|       12.41|  double|  tx_amount|
|  2|       false| boolean|    isValid|
|  2|byebye_world|  string|   greeting|
+---+------------+--------+-----------+

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复