Spark from_json with dynamic schema

后端 未结 3 1225
无人及你
无人及你 2021-02-08 02:37

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch cou

3条回答
  •  渐次进展
    2021-02-08 03:32

    If you have data as you mentioned in the question as

    val data = sc.parallelize(
        """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
        :: Nil)
    

    You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below

    val df = sqlContext.read.json(data)
    

    which will give you schema as below for the rdd data used above

    root
     |-- key1: string (nullable = true)
     |-- key2: string (nullable = true)
     |-- key3: struct (nullable = true)
     |    |-- key3_k1: string (nullable = true)
    

    And you can just select key3_k1 as

    df2.select("key3.key3_k1").show(false)
    //+-------+
    //|key3_k1|
    //+-------+
    //|key3_v1|
    //+-------+
    

    You can manipulate the dataframe as you wish. I hope the answer is helpful

提交回复
热议问题