Recently upgraded to Spark 2.0 and I\'m seeing some strange behavior when trying to create a simple Dataset from JSON strings. Here\'s a simple test case:
S
It happens because you don't provide schema for DataFrameReader
. As a result Spark has to eagerly scan data set to infer output schema.
Since mappedRdd
is not cached it will be evaluated twice:
data.show
If you want to prevent you should provide schema for reader (Scala syntax):
val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)