adding a new column using withColumn from a lookup table dynamically

前端 未结 2 372
予麋鹿
予麋鹿 2021-01-29 01:50

I am using spark-sql-2.4.1v with Java 8. I have a scenario where I need to dynamically add a column from a look up table.

I have data frame with columns A, B, C , ..., X,

相关标签:
2条回答
  • 2021-01-29 02:20

    In Scala, I would do like this

    val substitueMapping: Map[String, String] = ??? //this is your substitute map, this is small as it contains columns and their null substitutes
    
    val df = ??? //this is your main dataframe 
    
    val substitutedDf = substituteMapping.keys().foldLeft(df)((df, k) => {
        df.withColumn(k, when(col(k).isNull, col(substituteMapping(k))).otherwise(col(k)))
        //do approproate casting in above which you have done in post
    })
    

    I think foldLeft is not there in Java 8, you can emulate the same by modifying a variable repeatedly and doing iteration on substituteMapping.

    0 讨论(0)
  • 2021-01-29 02:23

    With Java8, you can use this Stream.reduce() overload:

    final Dataset<Row> dataframe = ...;
    final Map<String, String> substitutes = ...;
    
    final Dataset<Row> afterSubstitutions = codeSubstitutes.entrySet().stream()
        .reduce(dataframe, (df, entry) ->
                df.withColumn(entry.getKey(), when(/* replace with col(entry.getValue()) when null */)),
                (left, right) -> { throw new IllegalStateException("Can't merge two dataframes. This stream should not be a parallel one!"); }
        );
    

    The combiner (last argument) is supposed to merge two dataframes processed in parallel (if the stream was a parallel() stream), but we'll simply not allow that, as we're only invoking this logic on a sequential() stream.


    A more readable/maintainable version involves an extra-step for extracting the above logic into dedicated methods, such as:

        // ...
        Dataset<Row> nullSafeDf = codeSubstitutes.entrySet().stream()
            .reduce(dataframe, this::replaceIfNull, this::throwingCombiner);
        // ...
    }
    
    
    private Dataset<Row> replaceIfNull(Dataset<Row> df, Map.Entry<String, String> substitution) {
        final String original = substitution.getKey();
        final String replacement = substitution.getValue();
        return df.withColumn(original, when(col(original).isNull(), col(replacement))
                .otherwise(col(original)));
    }
    
    private <X> X throwingCombiner(X left, X right) {
        throw new IllegalStateException("Combining not allowed");
    }
    
    0 讨论(0)
提交回复
热议问题