How to generate multiple time series in one sql query?

前端未结

关注

 3  1665

余生分开走 2021-01-18 20:06

Here is the database layout. I have a table with sparse sales over time, aggregated per day. If for an item I have 10 sales on the 01-01-2015, I will have an entry, but If I

3条回答

被撕碎了的回忆 (楼主)

2021-01-18 20:25

Try this self-contained code where we have used 5 instead of 149 to keep the output short.

In (1) we use a single SQL statement, as required, to generate all the series producing a long form result. Normally in relational databases long form rather than wide form is used and this form may be preferable but in case not we follow this with a conversion to wide form using the reshape2 package.

In (2) we show how to replace the SQL statement with R code that uses the dplyr package.

1) PostgreSQL Regarding the SQL statement below, the innermost select generates a table 1, 2, ..., 5 whose column is day_of_year which is cross joined with entry_daily giving every combination of day_of_year with year and item and keeping only the distinct rows. This is then left joined with entry_daily to pick up the sales numbers which we sum over.

Assuming you have set up postgreSQL to work with SQL as in FAQ#12 on the sqldf home page ( https://github.com/ggrothendieck/sqldf ) the following should illustrate it and is self contained code that you can just copy and paste into your session.

library(sqldf)
library(RPostgreSQL)

# input data
entry_daily <- 
structure(list(day_of_year = c(1L, 1L, 7L), year = c(2015L, 2015L, 
2015L), sales = c(20L, 11L, 9L), item_id = structure(c(1L, 2L, 
1L), .Label = c("A1", "A2"), class = "factor")), .Names = c("day_of_year", 
"year", "sales", "item_id"), class = "data.frame", row.names = c(NA, 
-3L))

s <- sqldf("select A.item_id, A.year, A.day_of_year, sum(coalesce(B.sales, 0)) sales
       from (select distinct x.day_of_year, y.year, y.item_id
             from (select * from generate_series(1, 5) as day_of_year) as x
                   cross join entry_daily as y) as A
       left join entry_daily as B
       on A.year = B.year and A.day_of_year = B.day_of_year and
          A.item_id = B.item_id
       where A.year = 2015
       group by A.item_id, A.year, A.day_of_year
       order by A.item_id, A.year, A.day_of_year")

The output of the above query is this data.frame:

> s
   item_id year day_of_year sales
1       A1 2015           1    20
2       A1 2015           2     0
3       A1 2015           3     0
4       A1 2015           4     0
5       A1 2015           5     0
6       A2 2015           1    11
7       A2 2015           2     0
8       A2 2015           3     0
9       A2 2015           4     0
10      A2 2015           5     0

If you really need it in wide form then we can do that in R using dcast in the reshape2 package:

library(reshape2)
dcast(s, item_id + year ~ day_of_year, value.var = "sales")

giving:

  item_id year  1 2 3 4 5
1      A1 2015 20 0 0 0 0
2      A2 2015 11 0 0 0 0

2) dplyr Note that as an alternative to the SQL statement this R code would compute s:

library(dplyr)
s2 <- expand.grid(item_id = unique(entry_daily$item_id), 
                  year = 2015, 
                  day_of_year = 1:5) %>%
    left_join(entry_daily) %>%
    group_by(item_id, year, day_of_year) %>%
    summarize(sales = sum(sales, na.rm = TRUE)) %>%
    ungroup() %>%
    arrange(item_id, year, day_of_year)

giving:

> s2
Joining by: c("item_id", "year", "day_of_year")
Source: local data frame [10 x 4]
Groups: item_id, year [?]

   item_id  year day_of_year sales
    (fctr) (dbl)       (int) (int)
1       A1  2015           1    20
2       A1  2015           2     0
3       A1  2015           3     0
4       A1  2015           4     0
5       A1  2015           5     0
6       A2  2015           1    11
7       A2  2015           2     0
8       A2  2015           3     0
9       A2  2015           4     0
10      A2  2015           5     0

Now optionally use the same dcast as in (1).

0 讨论(0)

查看其它3个回答