Spark from_json with dynamic schema

后端 未结 3 1223
无人及你
无人及你 2021-02-08 02:37

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch cou

3条回答
  •  日久生厌
    2021-02-08 03:23

    This is just a restatement of @Ramesh Maharjan's answer, but with more modern Spark syntax.

    I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.

    def json(jsonDataset: Dataset[String]): DataFrame
    

    Example usage:

    val jsonStringDs = spark.createDataset[String](
      Seq(
          ("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
          ("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
    
    jsonStringDs.show
    
    jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
    +----------------------------------------------------------------------------------------------------------------------+
    |value                                                                                                                 
    |
    +----------------------------------------------------------------------------------------------------------------------+
    |{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
    |{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}  |
    +----------------------------------------------------------------------------------------------------------------------+
    
    
    val df = spark.read.json(jsonStringDs)
    df.show(false)
    
    df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
    +----------+------------------+-------------+---------+--------+------------+------+------------+
    |CEO       |address           |employeeCount|firstname|lastname|marketCap   |name  |revenue     |
    +----------+------------------+-------------+---------+--------+------------+------+------------+
    |null      |[London,Baker,121]|null         |Sherlock |Holmes  |null        |null  |null        |
    |Jeff Bezos|null              |500000       |null     |null    |817117000000|Amazon|177900000000|
    +----------+------------------+-------------+---------+--------+------------+------+------------+
    

    The method is available from Spark 2.2.0: http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader@json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame

提交回复
热议问题