Load Data using Apache-Spark on AWS

前端未结

关注

 2  2040

I am using Apache-Spark on Amazon Web Service (AWS)-EC2 to load and process data. I\'ve created one master and two slave nodes. On the master node, I have a directory data


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2021-01-28 17:57
              
            
            
                                                                       
Just to clarify for others that may come across this post.

I believe your confusion is due to not providing a protocol in the file location. When you do the following line:

### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data")  ### Read data directory 


Spark assumes the file path /root/data is in HDFS. In other words it looks for the files at hdfs:///root/data. 

You only need the files in one location, either locally on every node (not  the most efficient in terms of storage) or in HDFS that is distributed across the nodes.

If you wish to read files from local, use file:///path/to/local/file. If you wish to use HDFS use hdfs:///path/to/hdfs/file.

Hope this helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2021-01-28 17:58
              
            
            
                                                                       
One quick suggestion is to load csv from S3 instead of having it in local.

Here is a sample scala snippet which can be used to load a bucket from S3

val csvs3Path = "s3n://REPLACE_WITH_YOUR_ACCESS_KEY:REPLACE_WITH_YOUR_SECRET_KEY@REPLACE_WITH_YOUR_S3_BUCKET"
val dataframe = sqlContext.
                    read.
                    format("com.databricks.spark.csv").
                    option("header", "true").
                    load(leadsS3Path)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复