How to read multiple gzipped files from S3 into a single RDD?

前端未结

关注

 3  1915

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is as:

s3:///proj


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-12-09 17:57
              
            
            
                                                                       
Note: Under Spark 1.2, the proper format would be as follows:

val rdd = sc.textFile("s3n://<bucket>/<foo>/bar.*.gz")


That's s3n://, not s3://

You'll also want to put your credentials in conf/spark-env.sh as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2020-12-09 18:00
              
            
            
                                                                       
The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.

From the Spark docs:


  All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").


So in your case you should be able to open all those files as a single RDD using something like this:

rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")


Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the * and ? wildcards.

For example:

rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")


Briefly, what this does is:


The * matches all strings, so in this case all gz files in all folders under 201412?? will be loaded.
The ? matches a single character, so 201412?? will cover all days in December 2014 like 20141201, 20141202, and so forth.
The , lets you just load separate files at once into the same RDD, like the random-file.txt in this case.


Some notes about the appropriate URL scheme for S3 paths:


If you're running Spark on EMR, the correct URL scheme is s3://.
If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer, s3a:// is the way to go.
s3n:// has been deprecated on the open source side in favor of s3a://. You should only use s3n:// if you're running Spark on Hadoop 2.6 or older.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-12-09 18:08
              
            
            
                                                                       
Using AWS EMR with Spark 2.0.0 and SparkR in RStudio I've managed to read the gz compressed wikipedia stat files stored in S3 using the below command:

df <- read.text("s3://<bucket>/pagecounts-20110101-000000.gz")


Similarly, for all files under 'Jan 2011' you can use the above command like below:

df <- read.text("s3://<bucket>/pagecounts-201101??-*.gz")


See the SparkR API docs for more ways of doing it.
https://spark.apache.org/docs/latest/api/R/read.text.html
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复