Spark DataFrame serialized as invalid json

问题

TL;DR: When I dump a Spark DataFrame as json, I always end up with something like

{"key1": "v11", "key2": "v21"}
{"key1": "v12", "key2": "v22"}
{"key1": "v13", "key2": "v23"}

which is invalid json. I can manually edit the dumped file to get something I can parse:

[
  {"key1": "v11", "key2": "v21"},
  {"key1": "v12", "key2": "v22"},
  {"key1": "v13", "key2": "v23"}
]

but I'm pretty sure I'm missing something that would let me avoid this manual edit. I just don't now what.

More details:

I have a org.apache.spark.sql.DataFrame and I try dumping it to json using the following code:

myDataFrame.write.json("file.json")

I also tried with:

myDataFrame.toJSON.saveAsTextFile("file.json")

In both case it ends up dumping correctly each row, but it's missing a separating comma between the rows, and as well as square brackets. Consequently, when I subsequently try to parse this file the parser I use insults me and then fails.

I would be grateful to learn how I can dump valid json. (reading the documentation of the DataFrameWriter didn't provided me with any interesting hints.)

回答1:

This is an expected output. Spark uses JSON Lines-like format for a number of reasons:

It can parsed and loaded in parallel.
Parsing can be done without loading full file in memory.
It can be written in parallel.
It can be written without storing complete partition in memory.
Is valid input even if file is empty.
Finally Row in Spark is struct which maps to JSON object not array.
...

You can create desired output in a few ways, but it will always conflict with one of the above.

You can for example write a single JSON document for each partition:

import org.apache.spark.sql.functions._

df
  .groupBy(spark_partition_id)
  .agg(collect_list(struct(df.columns map col: _*)).alias("data"))
  .select($"data")
  .write
  .json(output_path)

You could prepend this with repartition(1) to get a single output file, but it is not something you want to do, unless data is very small.

1.6 alternative would be glom

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val newSchema = StructType(Seq(StructField("data", ArrayType(df.schema))))

sqlContext.createDataFrame(
  df.rdd.glom.flatMap(a => if(a.isEmpty) Seq() else Seq(Row(a))), 
  newSchema
)

来源：https://stackoverflow.com/questions/48503419/spark-dataframe-serialized-as-invalid-json

标签

json

apache-spark

apache-spark-sql

spark-dataframe