I\'m looking for a generic solution to extract all the json fields as columns from a JSON string column.
df = spark.read.load(path)
df.show()
Assuming json_data
is of type map
(which you can always convert to map
if it's not), you can use getItem
:
df = spark.createDataFrame([
[1, {"name": "abc", "depts": ["dep01", "dep02"]}],
[2, {"name": "xyz", "depts": ["dep03"], "sal": 100}]
],
['id', 'json_data']
)
df.select(
df.id,
df.json_data.getItem('name').alias('name'),
df.json_data.getItem('depts').alias('depts'),
df.json_data.getItem('sal').alias('sal')
).show()
+---+----+--------------+----+
| id|name| depts| sal|
+---+----+--------------+----+
| 1| abc|[dep01, dep02]|null|
| 2| xyz| [dep03]| 100|
+---+----+--------------+----+
A more dynamic way to extract columns:
cols = ['name', 'depts', 'sal']
df.select(df.id, *(df.json_data.getItem(col).alias(col) for col in cols)).show()