Dividing complex rows of dataframe to simple rows in Pyspark

后端未结

关注

 3  1609

I have this code:

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlCo


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2020-11-27 21:03
              
            
            
                                                                       
Ok, here is what I've come up with. Unfortunately, I had to leave the world of Row objects and enter the world of list objects because I couldn't find a way to append to a Row object.

That means this method is bit messy. If you can find a way to add a new column to a Row object, then this is NOT the way to go.

def add_id(row):
    it_list = []
    for i in range(0, len(row[1])):
        sm_list = []
        for j in row[1][i]:
            sm_list.append(j)
        sm_list.append(row[0])
        it_list.append(sm_list)
    return it_list

with_id = documents.flatMap(lambda x: add_id(x))

df = with_id.map(lambda x: Row(id=x[2], title=Row(value=x[0], max_dist=x[1]))).toDF()


When I run df.show(), I get:

+---+----------------+
| id|           title|
+---+----------------+
|  1|     [cars,1000]|
|  2|  [horse bus,50]|
|  2|[normal bus,100]|
|  3| [Airplane,5000]|
|  4|   [Bicycles,20]|
|  4| [Motorbikes,80]|
|  5|      [Trams,15]|
+---+----------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2020-11-27 21:04
              
            
            
                                                                       
I am using Spark Dataset API, and following solved the 'explode' requirement for me:

Dataset<Row> explodedDataset = initialDataset.selectExpr("ID","explode(finished_chunk) as chunks");


Note: The explode method of Dataset API is deprecated in Spark 2.4.5 and the documentation suggests using Select(shown above) or FlatMap.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2020-11-27 21:19
              
            
            
                                                                       
Just explode it:

from pyspark.sql.functions import explode

documents.withColumn("title", explode("title"))
## +---+----------------+
## | id|           title|
## +---+----------------+
## |  1|     [1000,cars]|
## |  2|  [50,horse bus]|
## |  2|[100,normal bus]|
## |  3| [5000,Airplane]|
## |  4|   [20,Bicycles]|
## |  4| [80,Motorbikes]|
## |  5|      [15,Trams]|
## +---+----------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复