How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

前端未结

关注

 2  1353

I have two problems in my intended solution:

1. My S3 store structure is as following:

mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2021-01-24 18:39
              
            
            
                                                                       
I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.

You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2021-01-24 18:47
              
            
            
                                                                       
Crawler will not take compressed and uncompressed data together , so it will not work out of box. 
It is better to write spark job in glue and use spark.read() 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复