adding a new column using withColumn from a lookup table dynamically

前端 未结 2 377
予麋鹿
予麋鹿 2021-01-29 01:50

I am using spark-sql-2.4.1v with Java 8. I have a scenario where I need to dynamically add a column from a look up table.

I have data frame with columns A, B, C , ..., X,

2条回答
  •  余生分开走
    2021-01-29 02:23

    With Java8, you can use this Stream.reduce() overload:

    final Dataset dataframe = ...;
    final Map substitutes = ...;
    
    final Dataset afterSubstitutions = codeSubstitutes.entrySet().stream()
        .reduce(dataframe, (df, entry) ->
                df.withColumn(entry.getKey(), when(/* replace with col(entry.getValue()) when null */)),
                (left, right) -> { throw new IllegalStateException("Can't merge two dataframes. This stream should not be a parallel one!"); }
        );
    

    The combiner (last argument) is supposed to merge two dataframes processed in parallel (if the stream was a parallel() stream), but we'll simply not allow that, as we're only invoking this logic on a sequential() stream.


    A more readable/maintainable version involves an extra-step for extracting the above logic into dedicated methods, such as:

        // ...
        Dataset nullSafeDf = codeSubstitutes.entrySet().stream()
            .reduce(dataframe, this::replaceIfNull, this::throwingCombiner);
        // ...
    }
    
    
    private Dataset replaceIfNull(Dataset df, Map.Entry substitution) {
        final String original = substitution.getKey();
        final String replacement = substitution.getValue();
        return df.withColumn(original, when(col(original).isNull(), col(replacement))
                .otherwise(col(original)));
    }
    
    private  X throwingCombiner(X left, X right) {
        throw new IllegalStateException("Combining not allowed");
    }
    

提交回复
热议问题