How to generate multiple time series in one sql query?

前端 未结 3 1660
余生分开走
余生分开走 2021-01-18 20:06

Here is the database layout. I have a table with sparse sales over time, aggregated per day. If for an item I have 10 sales on the 01-01-2015, I will have an entry, but If I

相关标签:
3条回答
  • 2021-01-18 20:25

    Try this self-contained code where we have used 5 instead of 149 to keep the output short.

    In (1) we use a single SQL statement, as required, to generate all the series producing a long form result. Normally in relational databases long form rather than wide form is used and this form may be preferable but in case not we follow this with a conversion to wide form using the reshape2 package.

    In (2) we show how to replace the SQL statement with R code that uses the dplyr package.

    1) PostgreSQL Regarding the SQL statement below, the innermost select generates a table 1, 2, ..., 5 whose column is day_of_year which is cross joined with entry_daily giving every combination of day_of_year with year and item and keeping only the distinct rows. This is then left joined with entry_daily to pick up the sales numbers which we sum over.

    Assuming you have set up postgreSQL to work with SQL as in FAQ#12 on the sqldf home page ( https://github.com/ggrothendieck/sqldf ) the following should illustrate it and is self contained code that you can just copy and paste into your session.

    library(sqldf)
    library(RPostgreSQL)
    
    # input data
    entry_daily <- 
    structure(list(day_of_year = c(1L, 1L, 7L), year = c(2015L, 2015L, 
    2015L), sales = c(20L, 11L, 9L), item_id = structure(c(1L, 2L, 
    1L), .Label = c("A1", "A2"), class = "factor")), .Names = c("day_of_year", 
    "year", "sales", "item_id"), class = "data.frame", row.names = c(NA, 
    -3L))
    
    s <- sqldf("select A.item_id, A.year, A.day_of_year, sum(coalesce(B.sales, 0)) sales
           from (select distinct x.day_of_year, y.year, y.item_id
                 from (select * from generate_series(1, 5) as day_of_year) as x
                       cross join entry_daily as y) as A
           left join entry_daily as B
           on A.year = B.year and A.day_of_year = B.day_of_year and
              A.item_id = B.item_id
           where A.year = 2015
           group by A.item_id, A.year, A.day_of_year
           order by A.item_id, A.year, A.day_of_year")
    

    The output of the above query is this data.frame:

    > s
       item_id year day_of_year sales
    1       A1 2015           1    20
    2       A1 2015           2     0
    3       A1 2015           3     0
    4       A1 2015           4     0
    5       A1 2015           5     0
    6       A2 2015           1    11
    7       A2 2015           2     0
    8       A2 2015           3     0
    9       A2 2015           4     0
    10      A2 2015           5     0
    

    If you really need it in wide form then we can do that in R using dcast in the reshape2 package:

    library(reshape2)
    dcast(s, item_id + year ~ day_of_year, value.var = "sales")
    

    giving:

      item_id year  1 2 3 4 5
    1      A1 2015 20 0 0 0 0
    2      A2 2015 11 0 0 0 0
    

    2) dplyr Note that as an alternative to the SQL statement this R code would compute s:

    library(dplyr)
    s2 <- expand.grid(item_id = unique(entry_daily$item_id), 
                      year = 2015, 
                      day_of_year = 1:5) %>%
        left_join(entry_daily) %>%
        group_by(item_id, year, day_of_year) %>%
        summarize(sales = sum(sales, na.rm = TRUE)) %>%
        ungroup() %>%
        arrange(item_id, year, day_of_year)
    

    giving:

    > s2
    Joining by: c("item_id", "year", "day_of_year")
    Source: local data frame [10 x 4]
    Groups: item_id, year [?]
    
       item_id  year day_of_year sales
        (fctr) (dbl)       (int) (int)
    1       A1  2015           1    20
    2       A1  2015           2     0
    3       A1  2015           3     0
    4       A1  2015           4     0
    5       A1  2015           5     0
    6       A2  2015           1    11
    7       A2  2015           2     0
    8       A2  2015           3     0
    9       A2  2015           4     0
    10      A2  2015           5     0
    

    Now optionally use the same dcast as in (1).

    0 讨论(0)
  • 2021-01-18 20:28

    Solution 1. Simple query with an aggregate.

    The simplest and fastest way to get the expected result. It is easy to parse the sales column within a client program.

    select item, string_agg(coalesce(sales, 0)::text, ',') sales
    from (
        select distinct item_id item, doy
        from generate_series (1, 10) doy  -- change 10 to given n
        cross join entry_daily
        ) sub
    left join entry_daily on item_id = item and day_of_year = doy
    group by 1
    order by 1;
    
     item |        sales         
    ------+----------------------
     A1   | 20,0,0,0,0,0,9,0,0,0
     A2   | 11,0,0,0,0,0,0,0,0,0
    (2 rows)
    

    Solution 2. Dynamically created view.

    Based on the solution 1 with array_agg() instead of string_agg(). The function creates a view with a given number of columns.

    create or replace function create_items_view(view_name text, days int)
    returns void language plpgsql as $$
    declare
        list text;
    begin
        select string_agg(format('s[%s] "%s"', i::text, i::text), ',')
        into list
        from generate_series(1, days) i;
    
        execute(format($f$
            drop view if exists %s;
            create view %s as select item, %s
            from (
                select item, array_agg(coalesce(sales, 0)) s
                from (
                    select distinct item_id item, doy
                    from generate_series (1, %s) doy
                    cross join entry_daily
                    ) sub
                left join entry_daily on item_id = item and day_of_year = doy
                group by 1
                order by 1
            ) q
            $f$, view_name, view_name, list, days)
        );
    end $$;
    

    Usage:

    select create_items_view('items_view_10', 10);
    
    select * from items_view_10;
    
     item | 1  | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 
    ------+----+---+---+---+---+---+---+---+---+----
     A1   | 20 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 |  0
     A2   | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  0
    (2 rows)
    

    Solution 3. Crosstab.

    Easy to use, but very uncomfortable with the greater number of columns due to the need to define the row format.

    create extension if not exists tablefunc;
    
    select * from crosstab (
        'select item_id, day_of_year, sales
        from entry_daily
        order by 1',
        'select i from generate_series (1, 10) i'
    ) as ct 
    (item_id text, "1" int, "2" int, "3" int, "4" int, "5" int, "6" int, "7" int, "8" int, "9" int, "10" int);
    
     item_id | 1  | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 
    ---------+----+---+---+---+---+---+---+---+---+----
     A1      | 20 |   |   |   |   |   | 9 |   |   |   
     A2      | 11 |   |   |   |   |   |   |   |   |   
    (2 rows)
    
    0 讨论(0)
  • 2021-01-18 20:28

    First you need a table with all dates to fill the blank dates. 100 years of date mean 36,000 rows so no very big. Instead of calculate every time.

    allDates:

    date_id
    s_date
    

    or created calculating the fields

    date_id
    s_date
    doy = EXTRACT(DOY FROM s_date)
    year = EXTRACT(YEAR FROM s_date)
    

    Your base query will be SQL FIDDLE DEMO:

    SELECT           
          AD.year,
          AD.doy,           
          allitems.item_id,
          COALESCE(SUM(ED.sales), 0) as max_sales
    FROM 
        (SELECT DISTINCT item_id
         FROM entry_daily 
        ) as allitems
    CROSS JOIN alldates AD
    LEFT JOIN entry_daily ED
           ON ED.day_of_year = AD.doy
          AND ED.year = AD.year  
          AND ED.item_id = allitems.item_id
    WHERE AD.year = 2015
    GROUP BY
         AD.year, AD.doy, allitems.item_id
    ORDER BY 
         AD.year, AD.doy, allitems.item_id
    

    You will have this OUTPUT

    | year | doy | item_id | max_sales |
    |------|-----|---------|-----------|
    | 2015 |   1 |      A1 |        20 |
    | 2015 |   1 |      A2 |        11 |
    | 2015 |   2 |      A1 |         0 |
    | 2015 |   2 |      A2 |         0 |
    | 2015 |   3 |      A1 |         0 |
    | 2015 |   3 |      A2 |         0 |
    | 2015 |   4 |      A1 |         0 |
    | 2015 |   4 |      A2 |         0 |
    | 2015 |   5 |      A1 |         0 |
    | 2015 |   5 |      A2 |         0 |
    | 2015 |   6 |      A1 |         0 |
    | 2015 |   6 |      A2 |         0 |
    | 2015 |   7 |      A1 |        39 |
    | 2015 |   7 |      A2 |         0 |
    | 2015 |   8 |      A1 |         0 |
    | 2015 |   8 |      A2 |         0 |
    | 2015 |   9 |      A1 |         0 |
    | 2015 |   9 |      A2 |         0 |
    | 2015 |  10 |      A1 |         0 |
    | 2015 |  10 |      A2 |         0 |
    

    Then you need install tablefunc

    and use crosstab to pivot this table SAMPLE

    0 讨论(0)
提交回复
热议问题