inferSchema in spark-csv package

后端 未结 1 650
陌清茗
陌清茗 2021-02-05 21:32

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?

I have the following csv file



        
相关标签:
1条回答
  • 2021-02-05 22:31

    2015-07-30

    The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema is not included in the latest release.

    2015-08-17

    The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:

    scala> df.printSchema
    root
     |-- Name: string (nullable = true)
     |-- Department: string (nullable = true)
     |-- years_of_experience: integer (nullable = true)
     |-- DOB: string (nullable = true)
    

    Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.

    Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.

    Finally you can easily parse date string manually:

    sqlContext
      .sql("SELECT *, DATE(dob) as dob_d  FROM df")
      .drop("DOB")
      .printSchema
    
    root
     |-- Name: string (nullable = true)
     |-- Department: string (nullable = true)
     |-- years_of_experience: integer (nullable = true)
     |-- dob_d: date (nullable = true)
    

    so it is really not a serious issue.

    2017-12-20:

    Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:

    • timestampFormat with default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
    • dateFormat with default yyyy-MM-dd

    See also How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

    0 讨论(0)
提交回复
热议问题