Using Spark to write a parquet file to s3 over s3a is very slow

前端未结

关注

 4  996

I\'m trying to write a parquet file out to Amazon S3 using Spark 1.6.1. The small parquet that I\'m generating is


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-12-04 18:42
              
            
            
                                                                       
One of the immediate approaches to speed up Spark writes to S3 is to use the EMRFS S3-optimized Committer .

However, if you use s3a this committer cannot be used:


  When the EMRFS S3-optimized Committer is Not Used
  
  The committer is not used under the following circumstances:

When writing to HDFS

-> When using the S3A file system

When using an output format other than Parquet, such as ORC or text

When using MapReduce or Spark's RDD API



I've tested this difference on AWS EMR 5.26, and using s3:// was 15%-30% faster than s3a:// (but still slow).

The fastest way I've managed to accomplish such a copy/write was to write Parquet to a local HDFS and then use s3distcp to copy to S3; in one specific scenario (a few hundreds of small files) this was 5x times faster than writing a DataFrame to Parquet directly to S3.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-12-04 18:43
              
            
            
                                                                       
Spark defaults cause a large amount of (probably) unnecessary overhead during I/O operations, especially when writing to S3. This article discusses this more thoroughly, but there are 2 settings you'll want to consider changing.


Using the DirectParquetOutputCommitter. By default, Spark will save all of the data to a temporary folder then move those files afterwards. Using the DirectParquetOutputCommitter will save time by directly writting to the S3 output path


No longer available in Spark 2.0+


As stated in the jira ticket, the current solution is to 


  
  Switch your code to using s3a and Hadoop 2.7.2+ ; it's better all round, gets better in Hadoop 2.8, and is the basis for s3guard 
  Use the Hadoop FileOutputCommitter and set mapreduce.fileoutputcommitter.algorithm.version to 2
  




-Schema merging is turned off by default as of Spark 1.5  Turn off Schema Merging. If schema merging is on, the driver node will scan all of the files to ensure a consistent schema. This is especially costly because it is not a distributed operation. Make sure this is turned off by doing


val file = sqx.read.option("mergeSchema", "false").parquet(path)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-12-04 18:49
              
            
            
                                                                       
The direct output committer is gone from the spark codebase; you are to write your own/resurrect the deleted code in your own JAR. IF you do so, turn speculation off in your work, and know that other failures can cause problems too, where problem is "invalid data".

On a brighter note, Hadoop 2.8 is going to add some S3A speedups specifically for reading optimised binary formats (ORC, Parquet) off S3; see HADOOP-11694 for details. And some people are working on using Amazon Dynamo for the consistent metadata store which should be able to do a robust O(1) commit at the end of work.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2020-12-04 19:06
              
            
            
                                                                       
I also had this issue. Additional from what the rest said, here is a complete explanation from AWS: https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/

During my experiment just changing to FileOutCommiter v2(from v1) improved the write 3-4x.

self.sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version", "2")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复