Spark from_json with dynamic schema

后端 未结 3 1214
无人及你
无人及你 2021-02-08 02:37

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch cou

相关标签:
3条回答
  • 2021-02-08 03:11

    I am not sure if my suggestion can help you although I had a similar case and I solved it as follows:

    1) So the idea is to use json rapture (or some other json library) to load JSON schema dynamically. For instance you could read the 1st row of the json file to discover the schema(similarly to what I do here with jsonSchema)

    2) Generate schema dynamically. First iterate through the dynamic fields (notice that I project values of key3 as Map[String, String]) and add a StructField for each one of them to schema

    3) Apply the generated schema into your dataframe

    import rapture.json._
    import jsonBackends.jackson._
    
    val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
    val json = Json.parse(jsonSchema)
    
    import scala.collection.mutable.ArrayBuffer
    import org.apache.spark.sql.types.StructField
    import org.apache.spark.sql.types.{StringType, StructType}
    
    val schema = ArrayBuffer[StructField]()
    //we could do this dynamic as well with json rapture
    schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
    
    val items = ArrayBuffer[StructField]()
    json.key3.as[Map[String, String]].foreach{
      case(k, v) => {
        items.append(StructField(k, StringType))
      }
    }
    val complexColumn =  new StructType(items.toArray)
    schema.append(StructField("key3", complexColumn))
    
    import org.apache.spark.SparkConf
    import org.apache.spark.sql.SparkSession
    val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
    
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    
    val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
    
    jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
    

    I used the next data as input:

    {"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
    {"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
    {"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
    

    And the output:

    +----+-------+--------+--------+--------+
    |key1|   key2| key3_k1| key3_k2| key3_k3|
    +----+-------+--------+--------+--------+
    |val1|source1|key3_v11|key3_v21|key3_v31|
    |val2|source2|key3_v12|key3_v22|key3_v32|
    |val2|source3|key3_v13|key3_v23|key3_v33|
    +----+-------+--------+--------+--------+
    

    An advanced alternative, which I haven't tested yet, would be to generate a case class e.g called JsonRow from the JSON schema in order to have a strongly typed dataset which provides better serialization performance apart the fact that make your code more maintainable. To make this work you need first to create a JsonRow.scala file then you should implement a sbt pre-build script which will modify the content of JsonRow.scala(you might have more than one of course) dynamically based on your source files. To generate class JsonRow dynamically you can use the next code:

    def generateClass(members: Map[String, String], name: String) : Any = {
        val classMembers = for (m <- members) yield {
            s"${m._1}: String"
        }
    
        val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
        classDef
      }
    

    The method generateClass accepts a map of strings to create the class members and the class name itself. The members of the generated class you can again populate them from you json schema:

    import org.codehaus.jackson.node.{ObjectNode, TextNode}
    import collection.JavaConversions._
    
    val mapping = collection.mutable.Map[String, String]()
    val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
    
    for (f <- fields) {
      (f.getKey, f.getValue) match {
        case (k: String, v: TextNode) => mapping(k) = v.asText
        case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
        case _ => None
      }
    }
    
    val dynClass = generateClass(mapping.toMap, "JsonRow")
    println(dynClass)
    

    This prints out:

    case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
    

    Good luck

    0 讨论(0)
  • 2021-02-08 03:23

    This is just a restatement of @Ramesh Maharjan's answer, but with more modern Spark syntax.

    I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.

    def json(jsonDataset: Dataset[String]): DataFrame
    

    Example usage:

    val jsonStringDs = spark.createDataset[String](
      Seq(
          ("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
          ("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
    
    jsonStringDs.show
    
    jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
    +----------------------------------------------------------------------------------------------------------------------+
    |value                                                                                                                 
    |
    +----------------------------------------------------------------------------------------------------------------------+
    |{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
    |{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}  |
    +----------------------------------------------------------------------------------------------------------------------+
    
    
    val df = spark.read.json(jsonStringDs)
    df.show(false)
    
    df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
    +----------+------------------+-------------+---------+--------+------------+------+------------+
    |CEO       |address           |employeeCount|firstname|lastname|marketCap   |name  |revenue     |
    +----------+------------------+-------------+---------+--------+------------+------+------------+
    |null      |[London,Baker,121]|null         |Sherlock |Holmes  |null        |null  |null        |
    |Jeff Bezos|null              |500000       |null     |null    |817117000000|Amazon|177900000000|
    +----------+------------------+-------------+---------+--------+------------+------+------------+
    

    The method is available from Spark 2.2.0: http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader@json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame

    0 讨论(0)
  • 2021-02-08 03:32

    If you have data as you mentioned in the question as

    val data = sc.parallelize(
        """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
        :: Nil)
    

    You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below

    val df = sqlContext.read.json(data)
    

    which will give you schema as below for the rdd data used above

    root
     |-- key1: string (nullable = true)
     |-- key2: string (nullable = true)
     |-- key3: struct (nullable = true)
     |    |-- key3_k1: string (nullable = true)
    

    And you can just select key3_k1 as

    df2.select("key3.key3_k1").show(false)
    //+-------+
    //|key3_k1|
    //+-------+
    //|key3_v1|
    //+-------+
    

    You can manipulate the dataframe as you wish. I hope the answer is helpful

    0 讨论(0)
提交回复
热议问题