I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated.
val edwDf = omniDataFrame .withColumn("field1", callUDF((value: String) => None)) .withColumn("field2", callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df")))) edwDf .select("field1", "field2") .save("odsoutdatafldr", "com.databricks.spark.csv");
It is possible to use lit(null)
:
import org.apache.spark.sql.functions.{lit, udf} case class Record(foo: Int, bar: String) val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF val dfWithFoobar = df.withColumn("foobar", lit(null: String))
One problem here is that the column type is null
:
scala> dfWithFoobar.printSchema root |-- foo: integer (nullable = false) |-- bar: string (nullable = true) |-- foobar: null (nullable = true)
and it is not retained by the csv
writer. If it is a hard requirement you can cast column to the specific type (lets say String), with either DataType
import org.apache.spark.sql.types.StringType df.withColumn("foobar", lit(null).cast(StringType))
or string description
df.withColumn("foobar", lit(null).cast("string"))
or use an UDF like this:
val getNull = udf(() => None: Option[String]) // Or some other type df.withColumn("foobar", getNull()).printSchema root |-- foo: integer (nullable = false) |-- bar: string (nullable = true) |-- foobar: string (nullable = true)