Partition By with Order By Clause in PostgreSQL

大兔子大兔子 提交于 2020-02-28 15:44:06

问题


I have a table with these values;

user_id ts                  val
uid1    19.05.2019 01:49:50  0
uid1    19.05.2019 01:50:15  0
uid1    19.05.2019 01:50:20  0
uid1    19.05.2019 01:59:50  1
uid1    19.05.2019 02:20:10  1
uid1    19.05.2019 02:20:15  0
uid1    19.05.2019 02:20:19  0
uid1    19.05.2019 02:30:53  1
uid1    19.05.2019 11:10:25  1
uid1    19.05.2019 11:13:40  0
uid1    19.05.2019 11:13:50  0
uid1    19.05.2019 11:20:19  1
uid2    19.05.2019 15:01:44  0
uid2    19.05.2019 15:05:55  0
uid2    19.05.2019 17:19:35  1
uid2    19.05.2019 17:20:01  0
uid2    19.05.2019 17:20:35  0
uid2    19.05.2019 19:15:50  1

When I query this table with only partition by clause, result seems like this;

Query : select *, sum(val) over (partition by user_id) as res from example_table;

user_id ts                  val res
uid1    19.05.2019 01:49:50  0  5
uid1    19.05.2019 01:50:15  0  5
uid1    19.05.2019 01:50:20  0  5
uid1    19.05.2019 01:59:50  1  5
uid1    19.05.2019 02:20:10  1  5
uid1    19.05.2019 02:20:15  0  5
uid1    19.05.2019 02:20:19  0  5
uid1    19.05.2019 02:30:53  1  5
uid1    19.05.2019 11:10:25  1  5
uid1    19.05.2019 11:13:40  0  5
uid1    19.05.2019 11:13:50  0  5
uid1    19.05.2019 11:20:19  1  5
uid2    19.05.2019 15:01:44  0  2
uid2    19.05.2019 15:05:55  0  2
uid2    19.05.2019 17:19:35  1  2
uid2    19.05.2019 17:20:01  0  2
uid2    19.05.2019 17:20:35  0  2
uid2    19.05.2019 19:15:50  1  2

In the above results, res column has total sum value of the val column for each partition. But, If I'll query table with partition by and order by, I'm getting these results;

Query: select *, sum(val) over (partition by user_id order by ts) as res from example_table;

user_id ts                  val res
uid1    19.05.2019 01:49:50  0  0
uid1    19.05.2019 01:50:15  0  0
uid1    19.05.2019 01:50:20  0  0
uid1    19.05.2019 01:59:50  1  1
uid1    19.05.2019 02:20:10  1  2
uid1    19.05.2019 02:20:15  0  2
uid1    19.05.2019 02:20:19  0  2
uid1    19.05.2019 02:30:53  1  3
uid1    19.05.2019 11:10:25  1  4
uid1    19.05.2019 11:13:40  0  4
uid1    19.05.2019 11:13:50  0  4
uid1    19.05.2019 11:20:19  1  5
uid2    19.05.2019 15:01:44  0  0
uid2    19.05.2019 15:05:55  0  0
uid2    19.05.2019 17:19:35  1  1
uid2    19.05.2019 17:20:01  0  1
uid2    19.05.2019 17:20:35  0  1
uid2    19.05.2019 19:15:50  1  2

But with order by clause, res column has the cumulative sum of the value column for each row for each partition.

Whyy? I can't understand this.


回答1:


Update

This behavior is documented here:

4.2.8. Window Function Calls

[..] The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this sets the frame to be all rows from the partition start up through the current row's last ORDER BY peer. Without ORDER BY, this means all rows of the partition are included in the window frame, since all rows become peers of the current row.

That means:

In absence of a frame_clauseRANGE UNBOUNDED PRECEDING is used by default. That includes:

  • All rows "preceding" the current row according to the ORDER BY clause
  • The current row
  • All rows which have the same values in the ORDER BY columns as the current row

In absence of an ORDER BY clause – ORDER BY NULL is assumed (though I'm guessing again). Thus the frame will include all rows from the partition, because the values in the ORDER BY column(s) are the same (which is always NULL) in every row.

Original answer:

Disclaimer: The following is more a guess than a qualified answer. I didn't find any documentation, which can confirm what I write. At the same time I don't think that currently given answers correctly explain the behavior.

The reason for the diffrence in the results is not directly the ORDER BY clause, since a + b + c is the same as c + b + a. The reason is (and that is my guess) that the ORDER BY clause implicitly defines the frame_clause as

rows between unbounded preceding and current row

Try the following query:

select *
, sum(val) over (partition by user_id) as res
, sum(val) over (partition by user_id order by ts) as res_order_by
, sum(val) over (
    partition by user_id
    order by ts
    rows between unbounded preceding and current row
  ) as res_order_by_unbounded_preceding
, sum(val) over (
    partition by user_id
    -- order by ts
    rows between unbounded preceding and current row
  ) as res_preceding
, sum(val) over (
    partition by user_id
    -- order by ts
    rows between current row and unbounded following
  ) as res_following
, sum(val) over (
    partition by user_id
    order by ts
    rows between unbounded preceding and unbounded following
  ) as res_orderby_preceding_following

from example_table;

db<>fiddle

You will see, that you can get a cumulative sum without an ORDER BY clause aswell as get a "full" sum with the ORDER BY clause.




回答2:


That is how order by works with window functions.

When it is not present, then the function acts like an aggregation function over the window frame definition. That is, it returns the same value for everything in the window frame.

When it is present, then the function acts in a cumulative fashion, with the result "up to" the current row.

Of course, this is also influenced by the window frame specification. However, your example queries do not include rows or range as well as order by.




回答3:


From 3.5. Window Functions:

...You can also control the order in which rows are processed by window functions using ORDER BY within OVER..

This is the difference of over (partition by user_id) in which there is no order for processing the rows inside each group that they are divided and over (partition by user_id order by ts) which processes the rows after sorting them by ts.
This means that for each row a new sum(val) is calculated based on and up to the position of the row in the sorted rows.
Maybe it's easier to understand this for the case of rank() window function, so visit the link at the beginning of this answer where there is a very good example and more about this topic.




回答4:


Let's create one simple example to understand it properly.

We have considered one bank table with daily credit and debit. The following query will calculate the daily balance and also total balance for a customer(partition by is used to divide the results for individual customers) as column names suggest with use of SUM analytical function with and without ORDER BY clause:

SQL> WITH BANK_TABLE (CUST_ID, DT, AMOUNT_CR_DR)
  2  AS
  3  (
  4  SELECT 1, DATE '2019-01-01', 1000 FROM DUAL UNION ALL
  5  SELECT 1, DATE '2019-01-02', 2000 FROM DUAL UNION ALL
  6  SELECT 1, DATE '2019-01-03', -1000 FROM DUAL UNION ALL
  7  SELECT 1, DATE '2019-01-04', -500 FROM DUAL UNION ALL
  8  SELECT 1, DATE '2019-01-05', 2000 FROM DUAL
  9  )
 10  SELECT DT, AMOUNT_CR_DR,
 11  SUM(AMOUNT_CR_DR) OVER (PARTITION BY CUST_ID) AS TOTAL_BALANCE_LIFE_TIME,
 12  SUM(AMOUNT_CR_DR) OVER (PARTITION BY CUST_ID ORDER BY DT) AS TOTAL_BALANCE_TILL_DATE
 13  FROM BANK_TABLE
 14  ORDER BY CUST_ID, DT;

DT        AMOUNT_CR_DR TOTAL_BALANCE_LIFE_TIME TOTAL_BALANCE_TILL_DATE
--------- ------------ ----------------------- -----------------------
01-JAN-19         1000                    3500                    1000
02-JAN-19         2000                    3500                    3000
03-JAN-19        -1000                    3500                    2000
04-JAN-19         -500                    3500                    1500
05-JAN-19         2000                    3500                    3500

Partition by clause is used to divide rows in group and order by clause is to calculate the value in that order.

So for rows in order,

For 1st row, sum will be returned for 1st row only.

For 2nd row, sum will be first row plus second row.

Same way till the last row of the partition.

Cheers!!



来源:https://stackoverflow.com/questions/57639840/partition-by-with-order-by-clause-in-postgresql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!