I am trying to load Parquet File in Spark as dataframe-
val df= spark.read.parquet(path)
I am getting -
org.apache.spark.SparkE
Take 1
SPARK-12854 Vectorize Parquet reader indicates that "ColumnarBatch supports structs and arrays" (cf. GitHub pull request 10820) starting with Spark 2.0.0
And SPARK-13518 Enable vectorized parquet reader by default, also starting with Spark 2.0.0, deals with property spark.sql.parquet.enableVectorizedReader
(cf. GitHub commit e809074)
My 2 cents: disable that "VectorizedReader" optimization and see what happens.
Take 2
Since the problem has been narrowed down to some empty files that do not display the same schema as "real" files, my 3 cents: experiment with spark.sql.parquet.mergeSchema
to see if the schema from real files takes precedence after merging.
Other than that, you might try to eradicate the empty files at write time, with some kind of re-partitioning e.g. coalesce(1)
(OK, 1 is a bit caricatural, but you see the point).