SPARK: How to parse a Array of JSON object using Spark

后端 未结 2 2031
灰色年华
灰色年华 2021-01-14 12:29

I have a file with normal columns and a column that contains a Json string which is as below. Also picture attached. Each row actually belongs to a column named Demo(not Vis

相关标签:
2条回答
  • 2021-01-14 13:10

    If your column with JSON looks like this

        import spark.implicits._
    
        val inputDF = Seq(
          ("""[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]"""),
          ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}]"""),
          ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}]""")
        ).toDF("Demographics")
    
      inputDF.show(false)
    +-------------------------------------------------------------------------------------------------------------------------+
    |Demographics                                                                                                             |
    +-------------------------------------------------------------------------------------------------------------------------+
    |[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]|
    |[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}] |
    |[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}] |
    +-------------------------------------------------------------------------------------------------------------------------+
    

    you can try to parse the column in the following way:

      val parsedJson: DataFrame = inputDF.selectExpr("Demographics", "from_json(Demographics, 'array<struct<key:string,value:string>>') as parsed_json")
    
      val splitted = parsedJson.select(
        col("parsed_json").as("Demographics"),
        col("parsed_json").getItem(0).as("device_kind_json"),
        col("parsed_json").getItem(1).as("country_code_json"),
        col("parsed_json").getItem(2).as("device_platform_json")
      )
    
      val result = splitted.select(
        col("Demographics"),
        col("device_kind_json.value").as("device_kind"),
        col("country_code_json.value").as("country_code"),
        col("device_platform_json.value").as("device_platform")
      )
    
      result.show(false)
    

    You will get the output:

    +------------------------------------------------------------------------+-----------+------------+---------------+
    |Demographics                                                            |device_kind|country_code|device_platform|
    +------------------------------------------------------------------------+-----------+------------+---------------+
    |[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
    |[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
    |[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
    +------------------------------------------------------------------------+-----------+------------+---------------+
    
    0 讨论(0)
  • 2021-01-14 13:22

    Aleh thank you for answer.It works fine. I did the solution in slightly different way because I am using 2.3.3 spark.

    val sch = ArrayType(StructType(Array(
      StructField("key", StringType, true),
      StructField("value", StringType, true)
    )))
    
    val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))
    
    val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
      .withColumn("country_code", expr("Demographics[1].value"))
      .withColumn("device_platform", expr("Demographics[2].value"))
    
    0 讨论(0)
提交回复
热议问题