Why does reading csv file with empty values lead to IndexOutOfBoundException?

后端 未结 4 1225
無奈伤痛
無奈伤痛 2021-01-19 19:10

I have a csv file with the foll struct

Name | Val1 | Val2 | Val3 | Val4 | Val5
John     1      2
Joe      1      2
David    1      2            10    11


        
相关标签:
4条回答
  • 2021-01-19 19:49

    Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):

    David,1,2,10,,11
    

    The problem is your CSV file contains 6 columns, yet with:

    val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
    

    You try to read 7 columns. Just change your mapping to:

    val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
    

    And Spark will take care of the rest.

    0 讨论(0)
  • 2021-01-19 19:51

    This is not answer to your question. But it may help to solve your problem.

    From the question I see that you are trying to create a dataframe from a CSV.

    Creating dataframe using CSV can be easily done using spark-csv package

    With the spark-csv below scala code can be used to read a CSV val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)

    For your sample data I got the following result

    +-----+----+----+----+----+----+
    | Name|Val1|Val2|Val3|Val4|Val5|
    +-----+----+----+----+----+----+
    | John|   1|   2|    |    |    |
    |  Joe|   1|   2|    |    |    |
    |David|   1|   2|    |  10|  11|
    +-----+----+----+----+----+----+
    

    You can also inferSchema with latest version. See this answer

    0 讨论(0)
  • 2021-01-19 19:54

    The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it

    David,1,2,10,,11

    You can read the csv file as text file as follow

    fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
    

    And then you can use your code to create dataframe from it

    0 讨论(0)
  • 2021-01-19 20:01

    You can do it as follows.

    val df = sqlContext
             .read
             .textfile(csvFilePath)
             .map(_.split(delimiter_of_file, -1)
             .map(
                 p => 
                  Row(
                    p(0), 
                    p(1),
                    p(2),
                    p(3),
                    p(4),
                    p(5),
                    p(6))
    

    Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.

    0 讨论(0)
提交回复
热议问题