It\'s CDH with Spark 1.6.
I am trying to import this Hypothetical CSV into a apache Spark DataFrame:
$ hadoop fs -cat test.csv
a,b,c,201
It's not really elegant but you can convert from timestamp to date like this (check last line):
val textData = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ",")
.option("dateFormat", "yyyy-MM-dd")
.option("inferSchema", "true")
.option("nullValue", "null")
.load("test.csv")
.withColumn("C4", expr("""to_date(C4)"""))
With a infer option for non-trivial cases it will probably not return the expected result. As you can see in InferSchema.scala:
if (field == null || field.isEmpty || field == nullValue) {
typeSoFar
} else {
typeSoFar match {
case NullType => tryParseInteger(field)
case IntegerType => tryParseInteger(field)
case LongType => tryParseLong(field)
case DoubleType => tryParseDouble(field)
case TimestampType => tryParseTimestamp(field)
case BooleanType => tryParseBoolean(field)
case StringType => StringType
case other: DataType =>
throw new UnsupportedOperationException(s"Unexpected data type $other")
It will only try to match each column with a timestamp type, not a date type, so the "out of the box solution" for this case is not possible. But with my experience the "easier" solution, is directly define the schema with the needed type, it will avoid the infer option set a type that only matches for the RDD evaluated not the entire data. Your final schema is an efficient solution.