Rolling (moving) median in Greenplum

早过忘川 提交于 2019-12-11 04:59:51

问题


I would like to calculate the rolling median for a column in Greenplum, i.e. as below:

|  x | rolling_median_x |
| -- + ---------------- |
|  4 |                4 |
|  1 |              2.5 |
|  3 |                3 |
|  2 |              2.5 |
|  1 |                2 |
|  6 |              2.5 |
|  9 |                3 |

x is an integer and for each row rolling_median_x shows the median of x for the current and preceding rows. E.g. for the third row rolling_median_x = median(4, 1, 3) = 3.

Things I've found out so far:

  • the median function can't be used as a framed window function, i.e. median(x) OVER(RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
  • the same is true for many other function such as percent_rank or nth_value
  • recursive self join is not supported in this version of Greenplum

As a matter of fact I was unable to find proper documentation on which functions can be used as framed window function in Greenplum...

I'm using Greenplum 4.3.4.0 (which is based on Postgres 8.2.15) and updating is not an option unfortunately.


回答1:


One remark - a citate from Wikipedia: ORDER BY

ORDER BY is the only way to sort the rows in the result set. Without this clause, the relational database system may return the rows in any order. If an ordering is required, the ORDER BY must be provided in the SELECT statement sent by the application. Although some database systems allow the specification of an ORDER BY clause in subselects or view definitions, the presence there has no effect. A view is a logical relational table, and the relational model mandates that a table is a set of rows, implying no sort order whatsoever.


Since you need to calculate the median for the current and preceding rows, you must have in the table an additional row that defines the order of rows and can be used to determine which rows precede given row and which ones come after.
Let say some id column like this:

| id | x | rolling_median_x |
|----|---|------------------|
|  1 | 4 |                4 |
|  2 | 1 |              2.5 |
|  3 | 3 |                3 |
|  4 | 2 |              2.5 |
|  5 | 1 |                2 |
|  6 | 6 |              2.5 |
|  7 | 9 |                3 |

If you cannot use analytic functions, then try pure SQL.
This article shows various methods of computing the Median with SQL.
I think the Henderson’s Median would be best for our needs:

SELECT CASE COUNT(*) % 2
       WHEN 0        -- even sized table
       THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt
                                  THEN P2.part_wgt
                                  ELSE NULL END))/2.0
       ELSE P1.part_wgt --odd sized table
       END AS median 
  FROM Parts AS P1, Parts AS P2
 GROUP BY P1.part_wgt
HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt
                  THEN 1
                  ELSE NULL END)
       = (COUNT(*) + 1) / 2;

Just run this query for each row as a dependent subquery, a general idea is like this:

SELECT t.*, (
        SELECT .... Henderson's query FROM table x
        WHERE x.id <= t.id
        ......
       ) As our_median
FROM table t

You can find an example implementation in this demo

SELECT t.*, (
    SELECT CASE COUNT(*) % 2
           WHEN 0        -- even sized table
           THEN (P1.x + MIN(CASE WHEN P2.x > P1.x
                                      THEN P2.x
                                      ELSE NULL END))/2.0
           ELSE P1.x --odd sized table
           END AS median 
      FROM Table333 AS P1, Table333 AS P2
      WHERE p1.id <= t.id AND p2.id <= t.id
     GROUP BY P1.x
    HAVING COUNT(CASE WHEN P1.x >= P2.x
                      THEN 1
                      ELSE NULL END)
           = (COUNT(*) + 1) / 2
    ) as Our_median
FROM Table333 t;

| id | x | rolling_median_x | our_median |
|----|---|------------------|------------|
|  1 | 4 |                4 |          4 |
|  2 | 1 |              2.5 |        2.5 |
|  3 | 3 |                3 |          3 |
|  4 | 2 |              2.5 |        2.5 |
|  5 | 1 |                2 |          2 |
|  6 | 6 |              2.5 |        2.5 |
|  7 | 9 |                3 |          3 |

This query will probably be slow - this is a price you must pay for having ancient version of PostgreSQL




回答2:


I'm using psql 8.2.15 and updating is not an option unfortunately.

Ouch.

If it was the rolling mean, things would be simple, but the rolling median will be very slow due to the need for sorting. The way to avoid this is to insert the values into a heap or a btree as they come which allows to get the rolling median without sorting on each new value. But this needs custom code.

I would use plpython to implement this:

Rolling median algorithm in C



来源:https://stackoverflow.com/questions/45943409/rolling-moving-median-in-greenplum

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!