I\'m unable to send each group of dataframe at a time to the executor.
I have a data as below in company_model_vals_df dataframe
.
-----
If I understand your question correctly, you want to manipulate the data separately for each "model_id","fiscal_quarter","fiscal_year"
.
If that's correct, you would do it with a groupBy()
, for example:
company_model_vals_df.groupBy("model_id","fiscal_quarter","fiscal_year").agg(avg($"col1") as "average")
If what you're looking for is to write each logical group into a separate folder, you can do that by writing:
company_model_vals_df.write.partitionBy("model_id","fiscal_quarter","fiscal_year").parquet("path/to/save")
There are few options here -
var dist_company_model_vals_list = company_model_vals_df
.select("model_id","fiscal_quarter","fiscal_year").distinct().collectAsList
Then filter company_model_vals_df
with output of dist_company_model_vals_list
list which provides several datasets that you can work independently, like
def rowList = {
import org.apache.spark.sql._
var dfList:Seq[DataFrame] = Seq()
for (data <- dist_company_model_vals_list.zipWithIndex) {
val i = data._2
val row = data.-1
val filterCol = col($"model_id").equalTo(row.get(i).getInt(0).and($"fiscal_quarter").equalTo(row.get(i).getInt(1).and($"fiscal_year").equalTo(row.get(i).getInt(2))
val resultDf = company_model_vals_df.filter(filterCol)
dfList +: = resultDf
}
dfList
}
partitionBy("model_id","fiscal_quarter","fiscal_year")
method on dataframeWriterto write them separately.