I\'ve a complex DataFrame structure and would like to null a column easily. I\'ve created implicit classes that wire functionality and easily address 2D DataFrame structure
I ran into the same issue and assuming you don't need the result to have any new fields or fields with different types, here is a solution that can do this without having to redefine the whole struct: Change value of nested column in DataFrame
Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:
case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])
val nullableDF = df.as[Root].map { root =>
val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
NullableRoot(root.name, nullableData)
}.toDF()
The resulting schema of nullableDF
will be:
root
|-- name: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- values: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)