I am using spark-sql-2.4.1v with Java 8. I have a scenario where I need to dynamically add a column from a look up table.
I have data frame with columns A, B, C , ..., X,
In Scala, I would do like this
val substitueMapping: Map[String, String] = ??? //this is your substitute map, this is small as it contains columns and their null substitutes
val df = ??? //this is your main dataframe
val substitutedDf = substituteMapping.keys().foldLeft(df)((df, k) => {
df.withColumn(k, when(col(k).isNull, col(substituteMapping(k))).otherwise(col(k)))
//do approproate casting in above which you have done in post
})
I think foldLeft
is not there in Java 8, you can emulate the same by modifying a variable repeatedly and doing iteration on substituteMapping
.
With Java8, you can use this Stream.reduce() overload:
final Dataset<Row> dataframe = ...;
final Map<String, String> substitutes = ...;
final Dataset<Row> afterSubstitutions = codeSubstitutes.entrySet().stream()
.reduce(dataframe, (df, entry) ->
df.withColumn(entry.getKey(), when(/* replace with col(entry.getValue()) when null */)),
(left, right) -> { throw new IllegalStateException("Can't merge two dataframes. This stream should not be a parallel one!"); }
);
The combiner (last argument) is supposed to merge two dataframes processed in parallel (if the stream was a parallel()
stream), but we'll simply not allow that, as we're only invoking this logic on a sequential()
stream.
A more readable/maintainable version involves an extra-step for extracting the above logic into dedicated methods, such as:
// ...
Dataset<Row> nullSafeDf = codeSubstitutes.entrySet().stream()
.reduce(dataframe, this::replaceIfNull, this::throwingCombiner);
// ...
}
private Dataset<Row> replaceIfNull(Dataset<Row> df, Map.Entry<String, String> substitution) {
final String original = substitution.getKey();
final String replacement = substitution.getValue();
return df.withColumn(original, when(col(original).isNull(), col(replacement))
.otherwise(col(original)));
}
private <X> X throwingCombiner(X left, X right) {
throw new IllegalStateException("Combining not allowed");
}