问题
For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON.
Can anyone help me with this?
I want to implement this in Pyspark.
Full Set Json look like this:
{
"email": "abctest@xxx.com",
"firstName": "name01",
"id": 6304,
"surname": "Optional",
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
LatestJson look like this:
{
"email": "test@xxx.com",
"firstName": "name01",
"surname": "Optional",
"id": 6304,
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1_changedData",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
In above for id=6304
we have received updates for the layer01.layer02.key1
and emailaddress
fileds.
So I need to update these values to full JSON, Kindly help me with this.
回答1:
You can load the 2 JSON files into Spark data frames and do a left_join
to get updates from the latest JSON data :
from pyspark.sql import functions as F
full_json_df = spark.read.json(full_json_path, multiLine=True)
latest_json_df = spark.read.json(latest_json_path, multiLine=True)
updated_df = full_json_df.alias("full").join(
latest_json_df.alias("latest"),
F.col("full.id") == F.col("latest.id"),
"left"
).select(
F.col("full.id"),
*[
F.when(F.col("latest.id").isNotNull(), F.col(f"latest.{c}")).otherwise(F.col(f"full.{c}")).alias(c)
for c in full_json_df.columns if c != 'id'
]
)
updated_df.show(truncate=False)
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|id |email |firstName|layer01 |surname |
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|6304|test@xxx.com|name01 |[value1, value2, value3, value4, [value1_changedData, value2], [[inner value01,], [, inner_value02]]]|Optional|
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
Update:
If the schema changes between full and latest JSONs, you can load the 2 files into the same data frame (this way the schemas are being merged) and then deduplicate per id
:
from pyspark.sql import Window
from pyspark.sql import functions as F
merged_json_df = spark.read.json("/path/to/{full_json.json,latest_json.json}", multiLine=True)
# order priority: latest file then full
w = Window.partitionBy(F.col("id")).orderBy(F.when(F.input_file_name().like('%latest%'), 0).otherwise(1))
updated_df = merged_json_df.withColumn("rn", F.row_number().over(w))\
.filter("rn = 1")\
.drop("rn")
updated_df.show(truncate=False)
来源:https://stackoverflow.com/questions/65913892/update-the-nested-json-with-another-nested-json-using-python