Why does reading csv file with empty values lead to IndexOutOfBoundException?

后端未结

关注

 4  1237

I have a csv file with the foll struct

Name | Val1 | Val2 | Val3 | Val4 | Val5
John     1      2
Joe      1      2
David    1      2            10    11


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-19 19:49
              
            
            
                                                                       
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):

David,1,2,10,,11


The problem is your CSV file contains 6 columns, yet with:

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )


You try to read 7 columns. Just change your mapping to:

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))


And Spark will take care of the rest.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2021-01-19 19:51
              
            
            
                                                                       
This is not answer to your question. But it may help to solve your problem.

From the question I see that you are trying to create a dataframe from a CSV.

Creating dataframe using CSV can be easily done using spark-csv package

With the spark-csv below scala code can be used to read a CSV 
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)

For your sample data I got the following result

+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John|   1|   2|    |    |    |
|  Joe|   1|   2|    |    |    |
|David|   1|   2|    |  10|  11|
+-----+----+----+----+----+----+


You can also inferSchema with latest version. See this answer
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉梦人生        
                
              
                            
                2021-01-19 19:54
              
            
            
                                                                       
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it 

David,1,2,10,,11

You can read the csv file as text file as follow 

fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})


And then you can use your code to create dataframe from it 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2021-01-19 20:01
              
            
            
                                                                       
You can do it as follows.

val df = sqlContext
         .read
         .textfile(csvFilePath)
         .map(_.split(delimiter_of_file, -1)
         .map(
             p => 
              Row(
                p(0), 
                p(1),
                p(2),
                p(3),
                p(4),
                p(5),
                p(6))


Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.            
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复