Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.
Using pyspark
on Spark2
The CSV file I am dealin
Use from_json with a schema that matches the actual data in attribute3
column to convert json to ArrayType:
Original data frame:
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: string (nullable = true)
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
Create the schema:
schema = ArrayType(
StructType([StructField("key", StringType()),
StructField("key2", IntegerType())]))
Use from_json
df = df.withColumn("attribute3", from_json(df.attribute3, schema))
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- key: string (nullable = true)
# | | |-- key2: integer (nullable = true), False)
#|date |attribute2|count|attribute3 |
#|2017-09-03|attribute1|2 |[[value, 2], [value, 2], [value, 2]]|