Spark migrate sql window function to RDD for better performance

后端 未结 2 1828
盖世英雄少女心
盖世英雄少女心 2021-01-01 04:54

A function should be executed for multiple columns in a data frame

def handleBias(df: DataFrame, colName: String, target: String = target) = {
    val w1 = W         


        
2条回答
  •  执笔经年
    2021-01-01 05:39

    Using aggregateByKey A simple explanation on aggregateByKey can be found here. Basically you use two functions: One which works inside a partition and one which works between partitions.

    You would need to do something like aggregate by the first column and build a data structure internally with a map for every element of the second column to aggregate and collect data there (of course you could do two aggregateByKey if you want). This will not solve the case of doing multiple runs on the code for each column you want to work with (you can do use aggregate as opposed to aggregateByKey to work on all data and put it in a map but that will probably give you even worse performance). The result would then be one line per key, if you want to move back to the original records (as window function does) you would actually need to either join this value with the original RDD or save all values internally and flatmap

    I do not believe this would provide you with any real performance improvement. You would be doing a lot of work to reimplement things that are done for you in SQL and while doing so you would be losing most of the advantages of SQL (catalyst optimization, tungsten memory management, whole stage code generation etc.)

    Improving the SQL

    What I would do instead is attempt to improve the SQL itself. For example, the result of the column in the window function appears to be the same for all values. Do you really need a window function? You can instead do a groupBy instead of a window function (and if you really need this per record you can try to join the results. This might provide better performance as it would not necessarily mean shuffling everything twice on every step).

提交回复
热议问题