Spark Row to JSON

后端 未结 4 2028
灰色年华
灰色年华 2020-11-30 03:19

I would like to create a JSON from a Spark v.1.6 (using scala) dataframe. I know that there is the simple solution of doing df.toJSON.

However, my probl

相关标签:
4条回答
  • 2020-11-30 04:02

    I use this command to solve the to_json problem:

    output_df = (df.select(to_json(struct(col("*"))).alias("content")))
    
    0 讨论(0)
  • 2020-11-30 04:11

    Here, no JSON parser, and it adapts to your schema:

    import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
    
    df.select(
      col(df.columns(0)),
      col(df.columns(1)),
      concat(
        lit("{"), 
        concat_ws(",",df.dtypes.slice(2, df.dtypes.length).map(dt => {
          val c = dt._1;
          val t = dt._2;
          concat(
            lit("\"" + c + "\":" + (if (t == "StringType") "\""; else "")  ),
            col(c),
            lit(if(t=="StringType") "\""; else "") 
          )
        }):_*), 
        lit("}")
      ) as "C"
    ).collect()
    
    0 讨论(0)
  • 2020-11-30 04:14

    First lets convert C's to a struct:

    val dfStruct = df.select($"A", $"B", struct($"C1", $"C2", $"C3").alias("C"))
    

    This is structure can be converted to JSONL using toJSON as before:

    dfStruct.toJSON.collect
    // Array[String] = Array(
    //   {"A":1,"B":"test","C":{"C1":"ab","C2":22,"C3":true}}, 
    //   {"A":2,"B":"mytest","C":{"C1":"gh","C2":17,"C3":false}})
    

    I am not aware of any built-in method that can convert a single column but you can either convert it individually and join or use your favorite JSON parser in an UDF.

    case class C(C1: String, C2: Int, C3: Boolean)
    
    object CJsonizer {
      import org.json4s._
      import org.json4s.JsonDSL._
      import org.json4s.jackson.Serialization
      import org.json4s.jackson.Serialization.write
    
      implicit val formats = Serialization.formats(org.json4s.NoTypeHints)
    
      def toJSON(c1: String, c2: Int, c3: Boolean) = write(C(c1, c2, c3))
    }
    
    
    val cToJSON = udf((c1: String, c2: Int, c3: Boolean) => 
      CJsonizer.toJSON(c1, c2, c3))
    
    df.withColumn("c_json", cToJSON($"C1", $"C2", $"C3"))
    
    0 讨论(0)
  • 2020-11-30 04:15

    Spark 2.1 should have native support for this use case (see #15354).

    import org.apache.spark.sql.functions.to_json
    df.select(to_json(struct($"c1", $"c2", $"c3")))
    
    0 讨论(0)
提交回复
热议问题