How can you parse a string that is json from an existing temp table using PySpark?

后端 未结 1 835
情深已故
情深已故 2021-01-05 22:05

I have an existing Spark dataframe that has columns as such:

--------------------
pid | response
--------------------
 12 | {\"status\":\"200\"}
相关标签:
1条回答
  • 2021-01-05 22:24

    From pyspark.sql.functions , you can use any of from_json,get_json_object,json_tuple to extract fields from json string as below,

    >>from pyspark.sql.functions import json_tuple,from_json,get_json_object
    >>> from pyspark.sql import SparkSession
    >>> spark = SparkSession.builder.getOrCreate()
    >>> l = [(12, '{"status":"200"}'),(13,'{"status":"200","somecol":"300"}')]
    >>> df = spark.createDataFrame(l,['pid','response'])
    >>> df.show()
    +---+--------------------+
    |pid|            response|
    +---+--------------------+
    | 12|    {"status":"200"}|
    | 13|{"status":"200",...|
    +---+--------------------+
    
    >>> df.printSchema()
    root
     |-- pid: long (nullable = true)
     |-- response: string (nullable = true)
    
    Using json_tuple :
    >>> df.select('pid',json_tuple(df.response,'status','somecol')).show()
    +---+---+----+
    |pid| c0|  c1|
    +---+---+----+
    | 12|200|null|
    | 13|200| 300|
    +---+---+----+
    
    Using from_json:
    >>> schema = StructType([StructField("status", StringType()),StructField("somecol", StringType())])
    >>> df.select('pid',from_json(df.response, schema).alias("json")).show()
    +---+----------+
    |pid|      json|
    +---+----------+
    | 12|[200,null]|
    | 13| [200,300]|
    +---+----------+
    
    Using get_json_object:
    >>> df.select('pid',get_json_object(df.response,'$.status').alias('status'),get_json_object(df.response,'$.somecol').alias('somecol')).show()
    +---+------+-------+
    |pid|status|somecol|
    +---+------+-------+
    | 12|   200|   null|
    | 13|   200|    300|
    +---+------+-------+
    
    0 讨论(0)
提交回复
热议问题