I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch cou
If you have data as you mentioned in the question as
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
You don't need to create schema
for json data. Spark sql can infer schema
from the json string. You just have to use SQLContext.read.json
as below
val df = sqlContext.read.json(data)
which will give you schema
as below for the rdd data used above
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: struct (nullable = true)
| |-- key3_k1: string (nullable = true)
And you can just select
key3_k1
as
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
You can manipulate the dataframe
as you wish. I hope the answer is helpful