pyspark-sql

How to calculate rolling sum with varying window sizes in PySpark

旧时模样 提交于 2020-12-29 04:42:31
问题 I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values? Input Data +-----------+---------+------------+------------+---+ | ProductId | StoreId | Date | Prediction | N | +-----------+---------+------------+------------+---+ | 1 | 100 | 2019-07-01 | 0.92 | 2 | | 1 | 100 | 2019-07-02 | 0.62 | 2 | | 1 | 100 | 2019-07-03 | 0.89 | 2 | | 1 | 100 | 2019-07-04

How to calculate rolling sum with varying window sizes in PySpark

£可爱£侵袭症+ 提交于 2020-12-29 04:42:13
问题 I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values? Input Data +-----------+---------+------------+------------+---+ | ProductId | StoreId | Date | Prediction | N | +-----------+---------+------------+------------+---+ | 1 | 100 | 2019-07-01 | 0.92 | 2 | | 1 | 100 | 2019-07-02 | 0.62 | 2 | | 1 | 100 | 2019-07-03 | 0.89 | 2 | | 1 | 100 | 2019-07-04

Spark 'limit' does not run in parallel?

我的未来我决定 提交于 2020-12-06 15:47:10
问题 I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only one task running in the cluster. This of course affects performance dramatically (removing the limit removes the single task bottleneck but lengthens the join as it works on a much larger dataset). Is limit truly not parallelizable? and if so- is there a workaround for this? I am using spark on

Select columns in Pyspark Dataframe

寵の児 提交于 2020-11-30 06:15:18
问题 I am looking for a way to select columns of my dataframe in pyspark. For the first row, I know I can use df.first() but not sure about columns given that they do not have column names. I have 5 columns and want to loop through each one of them. +--+---+---+---+---+---+---+ |_1| _2| _3| _4| _5| _6| _7| +--+---+---+---+---+---+---+ |1 |0.0|0.0|0.0|1.0|0.0|0.0| |2 |1.0|0.0|0.0|0.0|0.0|0.0| |3 |0.0|0.0|1.0|0.0|0.0|0.0| 回答1: Try something like this: df.select([c for c in df.columns if c in ['_2','