Get data from nested json in kafka stream pyspark

孤者浪人 提交于 2020-11-29 23:56:39

问题


I have a kafka producer sending large amounts of data in the format of

{
  '1000': 
    {
       '3': 
        {
           'seq': '1', 
           'state': '2', 
           'CMD': 'XOR' 
        }
    },
 '1001': 
    {
       '5': 
        {
           'seq': '2', 
           'state': '2', 
           'CMD': 'OR' 
        }
    },
 '1003': 
    {
       '5': 
        {
           'seq': '3', 
           'state': '4', 
           'CMD': 'XOR' 
        }
    }
}

.... the data I want is in the final loop: {'seq': '1', 'state': '2', 'CMD': 'XOR'} and the keys in the loops above('1000' and '3') are variable. Please note that the above values are only for example. the original dataset is huge with lots of variable keys. only the keys in the final loop{'seq', 'state', 'CMD'} are constant.

I have tried using the generic formats to read the data but am getting incorrect data since the loops above have variable keys and I am not sure how to define the schema to parse this format of data.

The output I am trying to achieve is a dataframe of the format

seq    state     CMD
----------------------
 1       2       XOR
 2       2        OR
 3       4       XOR

回答1:


This can be a working soluting for you - use explode() and getItem() as below-

Load the json into a Dataframe Here

a_json={
  '1000': 
    {
       '3': 
        {
           'seq': '1', 
           'state': '2', 
           'CMD': 'XOR' 
        }
    }
}
df = spark.createDataFrame([(a_json)])
df.show(truncate=False)

+-----------------------------------------+
|1000                                     |
+-----------------------------------------+
|[3 -> [CMD -> XOR, state -> 2, seq -> 1]]|
+-----------------------------------------+

Logic Here

df = df.select("*", F.explode("1000").alias("x", "y"))
df = df.withColumn("seq", df.y.getItem("seq")).withColumn("state", df.y.getItem("state")).withColumn("CMD", df.y.getItem("CMD"))
df.show(truncate=False)


 +-----------------------------------------+---+----------------------------------+---+-----+---+
|1000                                     |x  |y                                 |seq|state|CMD|
+-----------------------------------------+---+----------------------------------+---+-----+---+
|[3 -> [CMD -> XOR, state -> 2, seq -> 1]]|3  |[CMD -> XOR, state -> 2, seq -> 1]|1  |2    |XOR|
+-----------------------------------------+---+----------------------------------+---+-----+---+

Updating the Code based on Further Inputs

#Assuming that all the json columns are in a single column, hence making it an array column first.
df = df.withColumn("array_col", F.array("1000", "1001", "1003"))
#Then explode and getItem
df = df.withColumn("explod_col", F.explode("array_col"))
df = df.select("*", F.explode("explod_col").alias("x", "y"))
df_final = df.withColumn("seq", df.y.getItem("seq")).withColumn("state", df.y.getItem("state")).withColumn("CMD", df.y.getItem("CMD"))
df_final.select("seq","state","CMD").show()
|seq|state|CMD|
+---+-----+---+
|  1|    2|XOR|
|  2|    2| OR|
|  3|    4|XOR|
+---+-----+---+


来源:https://stackoverflow.com/questions/64640565/get-data-from-nested-json-in-kafka-stream-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!