问题
Why does Spark show nullable = true, when schema is not specified and its inference is left to Spark ?
// shows nullable = true for fields which are present in all JSON records.
spark.read.json("s3://s3path").printSchema()
Going through the class JsonInferSchema, can see that for StructType, explicitly nullable is set to true. But am unable to understand the reason behind it.
PS: My aim is to infer schema for a large JSON data set (< 100GB), and wanted to see if Spark provides the ability or would have to write a custom map-reduce job as highlighted in the paper: Schema Inference for Massive JSON Datasets. One major part is I want to know which fields are optional and which are mandatory (w.r.t the data set).
回答1:
Because it may do a sample of the data for schema inference in which it cannot 100% infer if null or not null, due to limited checking scope, sample size. Hence safer to set to null. That simple.
来源:https://stackoverflow.com/questions/61425977/why-spark-outputs-nullable-true-when-schema-inference-left-to-spark-in-case