Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

前端未结

关注

 3  1023

I would like to read multiple parquet files into a dataframe from S3. Currently, I\'m using the following method to do this:

files = [\'s3a://dev/2017/01/03/


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-12-08 13:10
              
            
            
                                                                       
A solution using union
files = ['s3a://dev/2017/01/03/data.parquet',
         's3a://dev/2017/01/02/data.parquet']

for i, file in enumerate(files):
    act_df = spark.read.parquet(file)   
    if i == 0:
        df = act_df
    else:
        df = df.union(act_df)

An advantage is that it can be done regardless any pattern.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-12-08 13:14
              
            
            
                                                                       
Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:

files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)


You can read more on patterns in Hadoop javadoc.

But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:


s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet


then you can take advantage of spark partitioning schema and read data by:

session.read.parquet('s3a://dev/') \
    .where(col('day').between('2017-01-02', '2017-01-03')


This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2020-12-08 13:20
              
            
            
                                                                       
Can I observe that as glob-pattern matching includes a full recursive tree-walk and pattern match of the paths, it is an absolute performance killer against object stores, especially S3. There's a special shortcut in spark to recognise when your path doesn't have any glob characters in, in which case it makes a more efficient choice.

Similarly, a very deep partitioning tree,as in that year/month/day layout, means many directories scanned, at a cost of hundreds of millis (or worse) per directory.

The layout suggested by Mariusz should be much more efficient, as it is a flatter directory tree —switching to it should have a bigger impact on performance on object stores than real filesystems.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复