可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?

I have the following csv file

Name,Department,years_of_experience,DOB Sam,Software,5,1990-10-10 Alex,Data Analytics,3,1992-10-10

I've read the CSV using the below code

val df = sqlContext.                   read.                   format("com.databricks.spark.csv").                   option("header", "true").                   option("inferSchema", "true").                   load(sampleAdDataS3Location) df.schema

All the columns are read as string. I expect the column years_of_experience to be read as int and DOB to be read as date

Please note that I've set the option inferSchema to true.

I am using the latest version (1.0.3) of spark-csv package

Am I missing something here?

回答1:

2015-07-30

The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema is not included in the latest release.

2015-08-17

The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:

scala> df.printSchema root  |-- Name: string (nullable = true)  |-- Department: string (nullable = true)  |-- years_of_experience: integer (nullable = true)  |-- DOB: string (nullable = true)

Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.

Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.

Finally you can easily parse date string manually:

sqlContext   .sql("SELECT *, DATE(dob) as dob_d  FROM df")   .drop("DOB")   .printSchema  root  |-- Name: string (nullable = true)  |-- Department: string (nullable = true)  |-- years_of_experience: integer (nullable = true)  |-- dob_d: date (nullable = true)

so it is really not a serious issue.

2017-12-20:

Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:

timestampFormat with default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
dateFormat with default yyyy-MM-dd

文章来源: inferSchema in spark-csv package

标签

dob