Create new Dataframe with empty/null field values

后端未结

关注

 2  415

I am creating a new Dataframe from an existing dataframe, but need to add new column (\"field1\" in below code) in this new DF. How do I do so? Working sample code example w

相关标签:

2条回答

北恋

2020-11-29 02:24

It is possible to use lit(null):

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))

One problem here is that the column type is null:

scala> dfWithFoobar.printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: null (nullable = true)

and it is not retained by the csv writer. If it is a hard requirement you can cast column to the specific type (lets say String), with either DataType

import org.apache.spark.sql.types.StringType

df.withColumn("foobar", lit(null).cast(StringType))

or string description

df.withColumn("foobar", lit(null).cast("string"))

or use an UDF like this:

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

A Python equivalent can be found here: Add an empty column to spark DataFrame

0 讨论(0)

无人及你

2020-11-29 02:46
Just to extend the perfect answer provided by @zero323, here's a solution which can be used starting from Spark 2.2.0.
```
import org.apache.spark.sql.functions.typedLit

df.withColumn("foobar", typedLit[Option[String]](None)).printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)
```
It's similar to the 3rd solution, but without using any UDF.
0 讨论(0)
发布评论:

提交评论
- 加载中...