Why does SparkSession execute twice for one action?

后端 未结 1 488
南笙
南笙 2020-12-04 00:02

Recently upgraded to Spark 2.0 and I\'m seeing some strange behavior when trying to create a simple Dataset from JSON strings. Here\'s a simple test case:

 S         


        
相关标签:
1条回答
  • 2020-12-04 00:33

    It happens because you don't provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.

    Since mappedRdd is not cached it will be evaluated twice:

    • once for schema inference
    • once when you call data.show

    If you want to prevent you should provide schema for reader (Scala syntax):

    val schema: org.apache.spark.sql.types.StructType = ???
    spark.read.schema(schema).json(mappedRdd)
    
    0 讨论(0)
提交回复
热议问题