How to do handle this use-case (running-window data) in spark

问题

I am using spark-sql-2.4.1v with java 1.8.

Have source data as below :

  val df_data = Seq(
  ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"),
  ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"),
  ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"),

  ("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"),
  ("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"),

  ("Indus_3","Indus_3_Name","Country1", "State3",27989978,"2019-03-01"),
  ("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-06-01"),
  ("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-03-01"),

  ("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2020-03-01"),
  ("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2018-03-01"),

  ("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2019-03-01"),
  ("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2018-03-01"),
  ("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2017-03-01"),

  ("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-03-01"),
  ("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-06-01"),
  ("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2018-03-01"),

  ("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-03-01"),
  ("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-12-01"),
  ("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2019-03-01"),

  ("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-03-01"),
  ("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-09-01"),
  ("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2016-03-01"),

  ("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2020-03-01"),
  ("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2019-09-01"),
  ("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2016-03-01")
  ).toDF("industry_id","industry_name","country","state","revenue","generated_date");

Query :

val distinct_gen_date = df_data.select("generated_date").distinct.orderBy(desc("generated_date"));

For each "generated_date" in list distinct_gen_date , need to get all unique industry_ids for 6 months data

val cols = {col("industry_id")}
 val ws = Window.partitionBy(cols).orderBy(desc("generated_date"));

val newDf = df_data
                .withColumn("rank",rank().over(ws))
                .where(col("rank").equalTo(lit(1)))
                //.drop(col("rank"))
                .select("*");

How to get moving aggregate (on unique industry_ids for 6 months data ) for each distinct item , how to achieve this moving aggregation.

more details :

Yes, in the given sample data given , assume, is from "2020-03-01" to "2016-03-01". if some industry_x is not there in "2020-03-01", need to check "2020-02-01" "2020-01-01","2019-12-01","2019-11-01","2019-10-01","2019-09-01" sequentically whenever we found thats rank-1 is taken into consider for that data set for calculating "2020-03-01" data......we next go .."2020-12-01" i.e. each distinct "generated_date".. for each distinct date go back 6 months get unique industries ..pick rank 1 data...this data for ."2020-12-01"...next pick another distinct "generated_date" and do same so on .....here dataset keep changing....using for loop I can do but it is not giving parallesum..how to pick distinct dataset for each distinct "generated_date" parallell ?

来源：https://stackoverflow.com/questions/61237947/how-to-do-handle-this-use-case-running-window-data-in-spark

标签

scala

apache-spark

apache-spark-sql

spark-streaming