How to cast string to ArrayType of dictionary (JSON) in PySpark

后端 未结 2 1930
失恋的感觉
失恋的感觉 2021-01-13 13:52

Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.

Using pyspark on Spark2

The CSV file I am dealin

2条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-01-13 14:32

    The answer by @Psidom does not work for me because I am using Spark 2.1.

    In my case, I had to slightly modify your attribute3 string to wrap it in a dictionary:

    import pyspark.sql.functions as f
    df2 = df.withColumn("attribute3", f.concat(f.lit('{"data": '), "attribute3", f.lit("}")))
    df2.select("attribute3").show(truncate=False)
    #+--------------------------------------------------------------------------------------+
    #|attribute3                                                                            |
    #+--------------------------------------------------------------------------------------+
    #|{"data": [{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]}|
    #+--------------------------------------------------------------------------------------+
    

    Now I can define the schema as follows:

    schema = StructType(
        [
            StructField(
                "data",
                ArrayType(
                    StructType(
                        [
                            StructField("key", StringType()),
                            StructField("key2", IntegerType())
                        ]
                    )
                )
            )
        ]
    )
    

    Now use from_json followed by getItem():

    df3 = df2.withColumn("attribute3", f.from_json("attribute3", schema).getItem("data"))
    df3.show(truncate=False)
    #+----------+----------+-----+---------------------------------+
    #|date      |attribute2|count|attribute3                       |
    #+----------+----------+-----+---------------------------------+
    #|2017-09-03|attribute1|2    |[[value,2], [value,2], [value,2]]|
    #+----------+----------+-----+---------------------------------+
    

    And the schema:

    df3.printSchema()
    # root
    # |-- attribute3: array (nullable = true)
    # |    |-- element: struct (containsNull = true)
    # |    |    |-- key: string (nullable = true)
    # |    |    |-- key2: integer (nullable = true)
    

提交回复
热议问题