pyspark/dataframe - creating a nested structure

后端未结

关注

 2  1724

i\'m using pyspark with dataframe and would like to create a nested structure as below

Before:

Column 1 | Column 2 | Column 3 
--------------------------


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2021-01-22 06:46
              
            
            
                                                                       
First, reproducible example of your dataframe.

js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
jsrdd = sc.parallelize(js)
sqlContext = SQLContext(sc)
jsdf = sqlContext.read.json(jsrdd)
jsdf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   B|   1|
|   A|   B|   2|
|   A|   C|   1|
+----+----+----+


Now, lists are not stored as key value pairs. You can either use a dictionary or simple collect_list() after doing a groupby on column2. 

jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
+----+----+------------------+
|col1|col2|collect_list(col3)|
+----+----+------------------+
|   A|   C|               [1]|
|   A|   B|            [1, 2]|
+----+----+------------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-01-22 06:49
              
            
            
                                                                       
I don't think you can get that exact output, but you can come close. The problem is your key names for the column 4. In Spark, structs need to have a fixed set of columns known in advance. But let's leave that for later, first, the aggregation:

import pyspark
from pyspark.sql import functions as F

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']

data = spark.createDataFrame(data, columns)

data.createOrReplaceTempView("data")
data.show()

# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
|      A|      B|      1|
|      A|      B|      2|
|      A|      C|      1|
+-------+-------+-------+

nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()

# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
 '{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']


Which is almost what you want, right? The problem is that if you do not know your key names in advance (that is, the values in Column 2), Spark cannot determine the structure of your data. Also, I am not entirely sure how you can use the value of a column as key for a structure unless you use a UDF (maybe with a PIVOT?):

datatype = 'struct<B:array<bigint>,C:array<bigint>>'  # Add any other potential keys here.
@F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
    return {column2_value: column4_value['data']}

nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()

# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
 '{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']


This of course has the drawback that the number of keys must be discrete and known in advance, otherwise other key values will be silently ignored.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复