How to create a custom windowing function for PostgreSQL? (Running Average Example)

前端 未结 4 601
清歌不尽
清歌不尽 2021-01-05 00:34

I would really like to better understand what is involved in creating a UDF that operates over windows in PostgreSQL. I did some searching about how to create UDFs in genera

相关标签:
4条回答
  • 2021-01-05 01:02

    PL/R provides such functionality. See here for some examples. That said, I'm not sure that it (currently) meets your requirement of "keep[ing] a running sum and observation count and [not] sum[ming] up almost identical sets of rows at each iteration" (see here).

    0 讨论(0)
  • 2021-01-05 01:03

    According to the documentation "Other window functions can be added by the user. Also, any built-in or user-defined normal aggregate function can be used as a window function." (section 4.2.8). That worked for me for computing stock split adjustments:

    CREATE OR REPLACE FUNCTION prod(float8, float8) RETURNS float8
      AS 'SELECT $1 * $2;'
      LANGUAGE SQL IMMUTABLE STRICT;
    
    CREATE AGGREGATE prods ( float8 ) (
      SFUNC = prod,
      STYPE = float8,
      INITCOND = 1.0
    );
    
    create or replace view demo.price_adjusted as
      select id, vd,
        prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as adjf,
        rawprice * prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as price
      from demo.prices_raw left outer join demo.adjustments using (id,vd);
    

    Here are the schemas of the two tables:

    CREATE TABLE demo.prices_raw (
      id VARCHAR(30),
      vd DATE,
      rawprice float8 );
    
    CREATE TABLE demo.adjustments (
      id VARCHAR(30),
      vd DATE,
      sdiv float);
    
    0 讨论(0)
  • 2021-01-05 01:03

    Starting with table

     payments
    +------------------------------+
    | customer_id | amount | item  |
    | 5           | 10     | book  |
    | 5           | 71     | mouse |
    | 7           | 13     | cover |
    | 7           | 22     | cable |
    | 7           | 19     | book  |
    +------------------------------+
    
    SELECT customer_id, 
        AVG(amount) OVER (PARTITION BY customer_id) AS avg_amount,   
        item, 
    FROM payments`
    

    we get

    +----------------------------------+
    | customer_id | avg_amount | item  |
    | 5           | 40.5       | book  |
    | 5           | 40.5       | mouse |
    | 7           | 18         | cover |
    | 7           | 18         | cable |
    | 7           | 18         | book  |
    +----------------------------------+
    

    AVG being an aggregate function, it can act as a window function. However not all window functions are aggregate functions. The aggregate functions are the non-sophisticated window functions.

    In the query above, let's not use the built-in AVG function and use our own implementation. Does the same, just implemented by the user. The query above becomes:

    SELECT customer_id, 
        my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,   
        item, 
    FROM payments`
    

    The only difference from the former query is that AVG has been replaced with my_avg. We now need to implement our custom function.

    On how to compute the average

    Sum up all the elements, then divide by the number of elements. For customer_id of 7, that would be (13 + 22 + 19) / 3 = 18. We can devide it in:

    • a step-by-step accumulation -- the sum.
    • a final operation -- division.

    On how the aggregate function gets to the result

    The average is computed in steps. Only the last value is necessary. Start with an initial value of 0.

    1. Feed 13. Compute the intermediate/accumulated sum, which is 13.
    2. Feed 22. Compute the accumulated sum, which needs the previous sum plus this element: 13 + 22 = 35
    3. Feed 19. Compute the accumulated sum, which needs the previous sum plus this element: 35 + 19 = 54. This is the total that needs to be divided by the number of element (3).
    4. The result of step 3. is fed to another function, that knows how to divide the accumulated sum by the number of elements

    What happened here is that the state started with the initial value of 0 and was changed with every step, then passed to the next step.

    State travels between steps for as long as there is data. When all data is consumed state goes to a final function (terminal operation). We want the state to contain all the information needed for the accumulator as well as by the terminal operation.

    In the specific case of computing the average, the terminal operation needs to know how many elements the accumulator worked with because it needs to divide by that. For that reason, the state needs to include both the accumulated sum and the number of elements.

    We need a tuple that will contain both. Pre-defined POINT PostgreSQL type to the rescue. POINT(5, 89) means an accumulated sum of 5 elements that has the value of 89. The initial state will be a POINT(0,0).

    The accumulator is implemented in what's called a state function. The terminal operation is implemented in what's called a final function.

    When defining a custom aggregate function we need to specify:

    • the aggregate function name and return type
    • the initial state
    • the type of the state that the infrastructure will pass between steps and to the final function
    • a state function -- knows how to perform the accumulation steps
    • a final function -- knows how to perform the terminal operation. Not always needed (e.g. in a custom implementation of SUM the final value of the accumulated sum is the result.)

    Here's the definition for the custom aggregate function.

    CREATE AGGREGATE my_avg (NUMERIC) ( -- NUMERIC is what the function returns
        initcond = '(0,0)', -- this is the initial state of type POINT
        stype = POINT, -- this is the type of the state that will be passed between steps
        sfunc = my_acc, -- this is the function that knows how to compute a new average from existing average and new element. Takes in the state (type POINT) and an element for the step (type NUMERIC)
        finalfunc my_final_func -- returns the result for the aggregate function. Takes in the state of type POINT (like all other steps) and returns the result as what the aggregate function returns - NUMERIC 
    );
    

    The only thing left is to define two functions my_acc and my_final_func.

    CREATE FUNCTION my_acc (state POINT, elem_for_step NUMERIC) -- performs accumulated sum
    RETURNS POINT
    LANGUAGE SQL
    AS $$
        -- state[0] is the number of elements, state[1] is the accumulated sum
        SELECT POINT(state[0]+1, state[1] + elem_for_step);
    $$;
    
    CREATE FUNCTION my_final_func (POINT) -- performs devision and returns final value
    RETURNS NUMERIC
    LANGUAGE SQL
    AS $$
        -- $1[1] is the sum, $1[0] is the number of elements
        SELECT ($1[1]/$1[0])::NUMERIC;
    $$;
    

    Now that the functions are available CREATE AGGREGATE defined above will run successfully. Now that we have the aggregate defined, the query based on my_avg instead of the built-in AVG can be run:

    SELECT customer_id, 
        my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,    
        item, 
    FROM payments`
    

    The results are identical with what you get when using the built-in AVG.

    The PostgreSQL documentation suggests that the users are limited to implementing user-defined aggregate functions:

    In addition to these [pre-defined window] functions, any built-in or user-defined general-purpose or statistical aggregate (i.e., not ordered-set or hypothetical-set aggregates) can be used as a window function;

    What I suspect ordered-set or hypothetical-set aggregates means:

    • the value returned is identical to all other rows (e.g. AVG and SUM. In contrast RANK returns different values for all rows in group depending on more sophisticated criteria)
    • it makes no sense to ORDER BY when PARTITIONing because the values are the same for all rows anyway. In contrast we want to ORDER BY when using RANK()

    Query:

    SELECT customer_id, item, rank() OVER (PARTITION BY customer_id ORDER BY amount desc) FROM payments;
    

    Geometric mean

    The following is a user-defined aggregate function that I found no built-in aggregate for and may be useful to some.

    The state function computes the average of the natural logarithms of the terms.

    The final function raises constant e to whatever the accumulator provides.

    CREATE OR REPLACE FUNCTION sum_of_log(state POINT, curr_val NUMERIC)
    RETURNS POINT
    LANGUAGE SQL
    AS $$
        SELECT POINT(state[0] + 1,
            (state[1] * state[0]+ LN(curr_val))/(state[0] + 1));
    $$;
    
    CREATE OR REPLACE FUNCTION e_to_avg_of_log(POINT)
    RETURNS NUMERIC
    LANGUAGE SQL
    AS $$
        select exp($1[1])::NUMERIC;
    $$;
    
    CREATE AGGREGATE geo_mean (NUMBER)
    (
        stype = NUMBER,
        initcond = '(0,0)', -- represent POINT value
        sfunc = sum_of_log,
        finalfunc = e_to_avg_of_log
    );
    
    0 讨论(0)
  • 2021-01-05 01:11

    You have to look to postgresql source code postgresql/src/backend/utils/adt/windowfuncs.c and postgresql/src/backend/executor/nodeWindowAgg.c

    There are no good documentation :( -- fully functional window function should be implemented only in C or PL/v8 - there are no API for other languages.

    http://www.pgcon.org/2009/schedule/track/Version%208.4/128.en.html presentation from author of implementation in PostgreSQL.

    I found only one non core implementation - http://api.pgxn.org/src/kmeans/kmeans-1.1.0/

    http://pgxn.org/dist/plv8/1.3.0/doc/plv8.html

    0 讨论(0)
提交回复
热议问题