How to cast string to ArrayType of dictionary (JSON) in PySpark

后端 未结 2 1929
失恋的感觉
失恋的感觉 2021-01-13 13:52

Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.

Using pyspark on Spark2

The CSV file I am dealin

相关标签:
2条回答
  • 2021-01-13 14:32

    The answer by @Psidom does not work for me because I am using Spark 2.1.

    In my case, I had to slightly modify your attribute3 string to wrap it in a dictionary:

    import pyspark.sql.functions as f
    df2 = df.withColumn("attribute3", f.concat(f.lit('{"data": '), "attribute3", f.lit("}")))
    df2.select("attribute3").show(truncate=False)
    #+--------------------------------------------------------------------------------------+
    #|attribute3                                                                            |
    #+--------------------------------------------------------------------------------------+
    #|{"data": [{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]}|
    #+--------------------------------------------------------------------------------------+
    

    Now I can define the schema as follows:

    schema = StructType(
        [
            StructField(
                "data",
                ArrayType(
                    StructType(
                        [
                            StructField("key", StringType()),
                            StructField("key2", IntegerType())
                        ]
                    )
                )
            )
        ]
    )
    

    Now use from_json followed by getItem():

    df3 = df2.withColumn("attribute3", f.from_json("attribute3", schema).getItem("data"))
    df3.show(truncate=False)
    #+----------+----------+-----+---------------------------------+
    #|date      |attribute2|count|attribute3                       |
    #+----------+----------+-----+---------------------------------+
    #|2017-09-03|attribute1|2    |[[value,2], [value,2], [value,2]]|
    #+----------+----------+-----+---------------------------------+
    

    And the schema:

    df3.printSchema()
    # root
    # |-- attribute3: array (nullable = true)
    # |    |-- element: struct (containsNull = true)
    # |    |    |-- key: string (nullable = true)
    # |    |    |-- key2: integer (nullable = true)
    
    0 讨论(0)
  • 2021-01-13 14:44

    Use from_json with a schema that matches the actual data in attribute3 column to convert json to ArrayType:

    Original data frame:

    df.printSchema()
    #root
    # |-- date: string (nullable = true)
    # |-- attribute2: string (nullable = true)
    # |-- count: long (nullable = true)
    # |-- attribute3: string (nullable = true)
    
    from pyspark.sql.functions import from_json
    from pyspark.sql.types import *
    

    Create the schema:

    schema = ArrayType(
        StructType([StructField("key", StringType()), 
                    StructField("key2", IntegerType())]))
    

    Use from_json:

    df = df.withColumn("attribute3", from_json(df.attribute3, schema))
    
    df.printSchema()
    #root
    # |-- date: string (nullable = true)
    # |-- attribute2: string (nullable = true)
    # |-- count: long (nullable = true)
    # |-- attribute3: array (nullable = true)
    # |    |-- element: struct (containsNull = true)
    # |    |    |-- key: string (nullable = true)
    # |    |    |-- key2: integer (nullable = true)
    
    df.show(1, False)
    #+----------+----------+-----+------------------------------------+
    #|date      |attribute2|count|attribute3                          |
    #+----------+----------+-----+------------------------------------+
    #|2017-09-03|attribute1|2    |[[value, 2], [value, 2], [value, 2]]|
    #+----------+----------+-----+------------------------------------+
    
    0 讨论(0)
提交回复
热议问题