How to send each group at a time to the spark executors?

后端 未结 2 1626
温柔的废话
温柔的废话 2021-01-17 04:07

I\'m unable to send each group of dataframe at a time to the executor.

I have a data as below in company_model_vals_df dataframe.

 -----         


        
2条回答
  •  执笔经年
    2021-01-17 04:56

    There are few options here -

    • you need to fork the dataset into several datasets and work them individually like ,
    var dist_company_model_vals_list =  company_model_vals_df
      .select("model_id","fiscal_quarter","fiscal_year").distinct().collectAsList
    

    Then filter company_model_vals_df with output of dist_company_model_vals_list list which provides several datasets that you can work independently, like

    def rowList = {
    import org.apache.spark.sql._
    var dfList:Seq[DataFrame] = Seq()
    for (data <- dist_company_model_vals_list.zipWithIndex) {
    val i = data._2
    val row = data.-1
    val filterCol = col($"model_id").equalTo(row.get(i).getInt(0).and($"fiscal_quarter").equalTo(row.get(i).getInt(1).and($"fiscal_year").equalTo(row.get(i).getInt(2))
    
       val resultDf = company_model_vals_df.filter(filterCol)    
    dfList +: = resultDf
          }
    dfList
    }
    
    • If your objective is to write the data, you can use partitionBy("model_id","fiscal_quarter","fiscal_year") method on dataframeWriterto write them separately.

提交回复
热议问题