问题
Can we put a default value in a field of dataframe while creating the dataframe? I am creating a spark dataframe from List<Object[]> rows
as :
List<org.apache.spark.sql.Row> sparkRows = rows.stream().map(RowFactory::create).collect(Collectors.toList());
Dataset<org.apache.spark.sql.Row> dataset = session.createDataFrame(sparkRows, schema);
While looking for a way, I found that org.apache.spark.sql.types.DataTypes
contains object of org.apache.spark.sql.types.Metadata
class. The documentation does not specify what is the exact purpose of the class :
/**
* Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,
* Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and
* Array[Metadata]. JSON is used for serialization.
*
* The default constructor is private. User should use either [[MetadataBuilder]] or
* `Metadata.fromJson()` to create Metadata instances.
*
* @param map an immutable map that stores the data
*
* @since 1.3.0
*/
This class supports a very limited datatypes, and there is no out of the box api for making use of this class for inserting a default value while dataset creation.
Where does one use the metadata, can someone share any real life use case?
I know we can have our own map function to iterate over the rows.stream().map(RowFactory::create)
and put default values. But is there any way we could do this using spark apis?
Edit : I am expecting some way similar to Oracle's DEFAULT
functionality. We define a default value for each column, according to its datatype, and while creating the dataframe, if there is no value or null, then use this default value.
来源:https://stackoverflow.com/questions/57358381/spark-create-dataframe-with-default-values