Spark from_json with dynamic schema

后端未结

关注

 3  1225

无人及你 2021-02-08 02:37

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch cou

3条回答

渐次进展 (楼主)

2021-02-08 03:32
If you have data as you mentioned in the question as
```
val data = sc.parallelize(
    """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
    :: Nil)
```
You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below
```
val df = sqlContext.read.json(data)
```
which will give you schema as below for the rdd data used above
```
root
 |-- key1: string (nullable = true)
 |-- key2: string (nullable = true)
 |-- key3: struct (nullable = true)
 |    |-- key3_k1: string (nullable = true)
```
And you can just select key3_k1 as
```
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
```
You can manipulate the dataframe as you wish. I hope the answer is helpful
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...