How to infer schema of JSON files?

不问归期 提交于 2019-12-19 12:08:45

问题


I have the following String in Java

{
    "header": {
        "gtfs_realtime_version": "1.0",
        "incrementality": 0,
        "timestamp": 1528460625,
        "user-data": "metra"
    },
    "entity": [{
            "id": "8424",
            "vehicle": {
                "trip": {
                    "trip_id": "UP-N_UN314_V1_D",
                    "route_id": "UP-N",
                    "start_time": "06:17:00",
                    "start_date": "20180608",
                    "schedule_relationship": 0
                },
                "vehicle": {
                    "id": "8424",
                    "label": "314"
                },
                "position": {
                    "latitude": 42.10085,
                    "longitude": -87.72896
                },
                "current_status": 2,
                "timestamp": 1528460601
            }
        }
    ]
}

that represent a JSON document. I want to infer a schema in a Spark Dataframe for a streaming application.

How can I split the fields of the String similarly to a CSV document (where I can call .split(""))?


回答1:


Quoting the official documentation Schema inference and partition of streaming DataFrames/Datasets:

By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting spark.sql.streaming.schemaInference to true.

You can then use spark.sql.streaming.schemaInference configuration property to enable schema inference. I'm not sure if that's going to work for JSON files.

What I usually do is to load a single file (in a batch query and before starting a streaming query) to infer the schema. That should work in your case. Just do the following.

// I'm leaving converting Scala to Java as a home exercise
val jsonSchema = spark
  .read
  .option("multiLine", true) // <-- the trick
  .json("sample.json")
  .schema
scala> jsonSchema.printTreeString
root
 |-- entity: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- vehicle: struct (nullable = true)
 |    |    |    |-- current_status: long (nullable = true)
 |    |    |    |-- position: struct (nullable = true)
 |    |    |    |    |-- latitude: double (nullable = true)
 |    |    |    |    |-- longitude: double (nullable = true)
 |    |    |    |-- timestamp: long (nullable = true)
 |    |    |    |-- trip: struct (nullable = true)
 |    |    |    |    |-- route_id: string (nullable = true)
 |    |    |    |    |-- schedule_relationship: long (nullable = true)
 |    |    |    |    |-- start_date: string (nullable = true)
 |    |    |    |    |-- start_time: string (nullable = true)
 |    |    |    |    |-- trip_id: string (nullable = true)
 |    |    |    |-- vehicle: struct (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- label: string (nullable = true)
 |-- header: struct (nullable = true)
 |    |-- gtfs_realtime_version: string (nullable = true)
 |    |-- incrementality: long (nullable = true)
 |    |-- timestamp: long (nullable = true)
 |    |-- user-data: string (nullable = true)

The trick is to use multiLine option so the entire file is a single row that you use to infer schema from.



来源:https://stackoverflow.com/questions/50760682/how-to-infer-schema-of-json-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!