SQL - Vertica: How to generate daily rows with most previous date data

空扰寡人 提交于 2020-01-06 06:58:36

问题


I have a base table like below:

score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)

   Upd_dt       URL    Score
 2019-07-26      A       x
 2019-07-26      B      alpha 
 2019-08-01      A       y
 2019-08-01      B      beta
 2019-08-03      A       z
 2019-08-03      B      gamma

And I want to create a table in daily-url level, using most previous date's value for the new rows, result should look like below:

score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-07-27','A','x'
UNION ALL SELECT DATE '2019-07-27','B','alpha'
UNION ALL SELECT DATE '2019-07-28','A','x'
UNION ALL SELECT DATE '2019-07-28','B','alpha'
UNION ALL SELECT DATE '2019-07-29','A','x'
UNION ALL SELECT DATE '2019-07-29','B','alpha'
UNION ALL SELECT DATE '2019-07-30','A','x'
UNION ALL SELECT DATE '2019-07-30','B','alpha'
UNION ALL SELECT DATE '2019-07-31','A','x'
UNION ALL SELECT DATE '2019-07-31','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-02','A','y'
UNION ALL SELECT DATE '2019-08-02','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
UNION ALL SELECT DATE '2019-08-04','A','z'
UNION ALL SELECT DATE '2019-08-04','B','gamma'
UNION ALL SELECT DATE '2019-08-05','A','z'
UNION ALL SELECT DATE '2019-08-05','B','gamma'
) 

Which looks like:

   Upd_dt       URL    Score 
 2019-07-26      A       x
 2019-07-26      B      alpha 
 2019-07-27      A       x
 2019-07-27      B      alpha 
 2019-07-28      A       x
 2019-07-28      B      alpha 
 2019-07-29      A       x
 2019-07-29      B      alpha 
 2019-07-30      A       x
 2019-07-30      B      alpha 
 2019-07-31      A       x
 2019-07-31      B      alpha 
 2019-08-01      A       y
 2019-08-01      B      beta
 2019-08-02      A       y
 2019-08-02      B      beta
 2019-08-03      A       z
 2019-08-03      B      gamma
 2019-08-04      A       z
 2019-08-04      B      gamma
 2019-08-05      A       z
 2019-08-05      B      gamma
.
.
.

Current process is: I built a daily dimension table since 7/26/2019 till today by:

/* SELECT CAST(slice_time AS DATE) dates FROM testcalendar mtc TIMESERIES slice_time as '1 day' OVER (ORDER BY CAST(mtc.dates as TIMESTAMP)); */

so I get:

Dates

2019-07-26

2019-07-27

2019-07-28

2019-07-29

.

.

.

2019-10-12 (today)

I'm thinking if I can use function such as "interpolate previous value" to join my first table by dates, to generate missing days by using values from most previous date data, while it failed.

The result didn't generate rows for missing days.

Please let me know if anyone has any better idea on this.

Thanks!


回答1:


As a starting warning : only store a "daily photograph" when it really, really is necessary. In my past, I once ended up having 364 rows too many per year, as the values only changed once a year. In Vertica, that costs license, and CPU and clock time for joining and grouping ...

But, for the rest - Good start.

But you could apply the TIMESERIES without having to build a calendar.

The trick is to "extrapolate" manually what you can INTERPOLATE automatically.

Add an in-line 'padding' table, which contains the newest value per URL, but give it CURRENT_DATE instead of the newest actual date - using Vertica's peculiar analytic limit clause LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC) .

UNION SELECT that padding table with your input, and apply the TIMESERIES clause to that UNION SELECT.

Like so:

WITH
-- your input ...
score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)
-- real WITH clause would start here ...                                                                                                                                                 
,
-- newest row per Url, just with current date
pad_newest AS (
SELECT
  CURRENT_DATE
, url 
, score
FROM score_upd
LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC)
)   
,   
with_newest AS (
SELECT
  *   
FROM score_upd
UNION ALL 
SELECT *
FROM pad_newest
)   
SELECT
  ts_dt::DATE           AS upd_dt
, url                   AS url 
, TS_FIRST_VALUE(score) AS score
FROM with_newest
TIMESERIES ts_dt AS '1 day' OVER (
  PARTITION BY url ORDER BY upd_dt::TIMESTAMP
)   
ORDER BY 1,2 
;   


来源:https://stackoverflow.com/questions/58379960/sql-vertica-how-to-generate-daily-rows-with-most-previous-date-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!