Why Spark outputs nullable = true, when schema inference left to Spark, in case of JSON?

问题

Why does Spark show nullable = true, when schema is not specified and its inference is left to Spark ?

// shows nullable = true for fields which are present in all JSON records.
spark.read.json("s3://s3path").printSchema()

Going through the class JsonInferSchema, can see that for StructType, explicitly nullable is set to true. But am unable to understand the reason behind it.

PS: My aim is to infer schema for a large JSON data set (< 100GB), and wanted to see if Spark provides the ability or would have to write a custom map-reduce job as highlighted in the paper: Schema Inference for Massive JSON Datasets. One major part is I want to know which fields are optional and which are mandatory (w.r.t the data set).

回答1:

Because it may do a sample of the data for schema inference in which it cannot 100% infer if null or not null, due to limited checking scope, sample size. Hence safer to set to null. That simple.

来源：https://stackoverflow.com/questions/61425977/why-spark-outputs-nullable-true-when-schema-inference-left-to-spark-in-case

标签

json

dataframe

apache-spark

jsonschema

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!