问题
I have a kafka producer sending large amounts of data in the format of
{
'1000':
{
'3':
{
'seq': '1',
'state': '2',
'CMD': 'XOR'
}
},
'1001':
{
'5':
{
'seq': '2',
'state': '2',
'CMD': 'OR'
}
},
'1003':
{
'5':
{
'seq': '3',
'state': '4',
'CMD': 'XOR'
}
}
}
....
the data I want is in the final loop: {'seq': '1', 'state': '2', 'CMD': 'XOR'}
and the keys in the loops above('1000' and '3')
are variable. Please note that the above values are only for example. the original dataset is huge with lots of variable keys. only the keys in the final loop{'seq', 'state', 'CMD'}
are constant.
I have tried using the generic formats to read the data but am getting incorrect data since the loops above have variable keys and I am not sure how to define the schema to parse this format of data.
The output I am trying to achieve is a dataframe of the format
seq state CMD
----------------------
1 2 XOR
2 2 OR
3 4 XOR
回答1:
This can be a working soluting for you - use explode()
and getItem()
as below-
Load the json into a Dataframe Here
a_json={
'1000':
{
'3':
{
'seq': '1',
'state': '2',
'CMD': 'XOR'
}
}
}
df = spark.createDataFrame([(a_json)])
df.show(truncate=False)
+-----------------------------------------+
|1000 |
+-----------------------------------------+
|[3 -> [CMD -> XOR, state -> 2, seq -> 1]]|
+-----------------------------------------+
Logic Here
df = df.select("*", F.explode("1000").alias("x", "y"))
df = df.withColumn("seq", df.y.getItem("seq")).withColumn("state", df.y.getItem("state")).withColumn("CMD", df.y.getItem("CMD"))
df.show(truncate=False)
+-----------------------------------------+---+----------------------------------+---+-----+---+
|1000 |x |y |seq|state|CMD|
+-----------------------------------------+---+----------------------------------+---+-----+---+
|[3 -> [CMD -> XOR, state -> 2, seq -> 1]]|3 |[CMD -> XOR, state -> 2, seq -> 1]|1 |2 |XOR|
+-----------------------------------------+---+----------------------------------+---+-----+---+
Updating the Code based on Further Inputs
#Assuming that all the json columns are in a single column, hence making it an array column first.
df = df.withColumn("array_col", F.array("1000", "1001", "1003"))
#Then explode and getItem
df = df.withColumn("explod_col", F.explode("array_col"))
df = df.select("*", F.explode("explod_col").alias("x", "y"))
df_final = df.withColumn("seq", df.y.getItem("seq")).withColumn("state", df.y.getItem("state")).withColumn("CMD", df.y.getItem("CMD"))
df_final.select("seq","state","CMD").show()
|seq|state|CMD|
+---+-----+---+
| 1| 2|XOR|
| 2| 2| OR|
| 3| 4|XOR|
+---+-----+---+
来源:https://stackoverflow.com/questions/64640565/get-data-from-nested-json-in-kafka-stream-pyspark