How to send each group at a time to the spark executors?

后端 未结 2 1627
温柔的废话
温柔的废话 2021-01-17 04:07

I\'m unable to send each group of dataframe at a time to the executor.

I have a data as below in company_model_vals_df dataframe.

 -----         


        
相关标签:
2条回答
  • 2021-01-17 04:49

    If I understand your question correctly, you want to manipulate the data separately for each "model_id","fiscal_quarter","fiscal_year".

    If that's correct, you would do it with a groupBy(), for example:

    company_model_vals_df.groupBy("model_id","fiscal_quarter","fiscal_year").agg(avg($"col1") as "average")
    

    If what you're looking for is to write each logical group into a separate folder, you can do that by writing:

    company_model_vals_df.write.partitionBy("model_id","fiscal_quarter","fiscal_year").parquet("path/to/save")
    
    0 讨论(0)
  • 2021-01-17 04:56

    There are few options here -

    • you need to fork the dataset into several datasets and work them individually like ,
    var dist_company_model_vals_list =  company_model_vals_df
      .select("model_id","fiscal_quarter","fiscal_year").distinct().collectAsList
    

    Then filter company_model_vals_df with output of dist_company_model_vals_list list which provides several datasets that you can work independently, like

    def rowList = {
    import org.apache.spark.sql._
    var dfList:Seq[DataFrame] = Seq()
    for (data <- dist_company_model_vals_list.zipWithIndex) {
    val i = data._2
    val row = data.-1
    val filterCol = col($"model_id").equalTo(row.get(i).getInt(0).and($"fiscal_quarter").equalTo(row.get(i).getInt(1).and($"fiscal_year").equalTo(row.get(i).getInt(2))
    
       val resultDf = company_model_vals_df.filter(filterCol)    
    dfList +: = resultDf
          }
    dfList
    }
    
    • If your objective is to write the data, you can use partitionBy("model_id","fiscal_quarter","fiscal_year") method on dataframeWriterto write them separately.
    0 讨论(0)
提交回复
热议问题