How to cast string to ArrayType of dictionary (JSON) in PySpark

后端 未结 2 1934
失恋的感觉
失恋的感觉 2021-01-13 13:52

Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.

Using pyspark on Spark2

The CSV file I am dealin

2条回答
  •  醉梦人生
    2021-01-13 14:44

    Use from_json with a schema that matches the actual data in attribute3 column to convert json to ArrayType:

    Original data frame:

    df.printSchema()
    #root
    # |-- date: string (nullable = true)
    # |-- attribute2: string (nullable = true)
    # |-- count: long (nullable = true)
    # |-- attribute3: string (nullable = true)
    
    from pyspark.sql.functions import from_json
    from pyspark.sql.types import *
    

    Create the schema:

    schema = ArrayType(
        StructType([StructField("key", StringType()), 
                    StructField("key2", IntegerType())]))
    

    Use from_json:

    df = df.withColumn("attribute3", from_json(df.attribute3, schema))
    
    df.printSchema()
    #root
    # |-- date: string (nullable = true)
    # |-- attribute2: string (nullable = true)
    # |-- count: long (nullable = true)
    # |-- attribute3: array (nullable = true)
    # |    |-- element: struct (containsNull = true)
    # |    |    |-- key: string (nullable = true)
    # |    |    |-- key2: integer (nullable = true)
    
    df.show(1, False)
    #+----------+----------+-----+------------------------------------+
    #|date      |attribute2|count|attribute3                          |
    #+----------+----------+-----+------------------------------------+
    #|2017-09-03|attribute1|2    |[[value, 2], [value, 2], [value, 2]]|
    #+----------+----------+-----+------------------------------------+
    

自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题