问题
I have the following String in Java
{
"header": {
"gtfs_realtime_version": "1.0",
"incrementality": 0,
"timestamp": 1528460625,
"user-data": "metra"
},
"entity": [{
"id": "8424",
"vehicle": {
"trip": {
"trip_id": "UP-N_UN314_V1_D",
"route_id": "UP-N",
"start_time": "06:17:00",
"start_date": "20180608",
"schedule_relationship": 0
},
"vehicle": {
"id": "8424",
"label": "314"
},
"position": {
"latitude": 42.10085,
"longitude": -87.72896
},
"current_status": 2,
"timestamp": 1528460601
}
}
]
}
that represent a JSON document. I want to infer a schema in a Spark Dataframe for a streaming application.
How can I split the fields of the String similarly to a CSV document (where I can call .split("")
)?
回答1:
Quoting the official documentation Schema inference and partition of streaming DataFrames/Datasets:
By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference
to true.
You can then use spark.sql.streaming.schemaInference
configuration property to enable schema inference. I'm not sure if that's going to work for JSON files.
What I usually do is to load a single file (in a batch query and before starting a streaming query) to infer the schema. That should work in your case. Just do the following.
// I'm leaving converting Scala to Java as a home exercise
val jsonSchema = spark
.read
.option("multiLine", true) // <-- the trick
.json("sample.json")
.schema
scala> jsonSchema.printTreeString
root
|-- entity: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- vehicle: struct (nullable = true)
| | | |-- current_status: long (nullable = true)
| | | |-- position: struct (nullable = true)
| | | | |-- latitude: double (nullable = true)
| | | | |-- longitude: double (nullable = true)
| | | |-- timestamp: long (nullable = true)
| | | |-- trip: struct (nullable = true)
| | | | |-- route_id: string (nullable = true)
| | | | |-- schedule_relationship: long (nullable = true)
| | | | |-- start_date: string (nullable = true)
| | | | |-- start_time: string (nullable = true)
| | | | |-- trip_id: string (nullable = true)
| | | |-- vehicle: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- label: string (nullable = true)
|-- header: struct (nullable = true)
| |-- gtfs_realtime_version: string (nullable = true)
| |-- incrementality: long (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- user-data: string (nullable = true)
The trick is to use multiLine
option so the entire file is a single row that you use to infer schema from.
来源:https://stackoverflow.com/questions/50760682/how-to-infer-schema-of-json-files