I have data which has missing values irregulaly, and I'd like to convert it with a certain interval with liner interpolation using BigQuery Standard SQL.
Specifically, I have data like this:
# data is missing irregulary
| time | value |
| 1 | 3.0 |
| 5 | 5.0 |
| 7 | 1.0 |
| 9 | 8.0 |
| 10 | 4.0 |
and I'd like to convert this table as follows:
# interpolated with interval of 1
| time | value_interpolated |
| 1 | 3.0 |
| 2 | 3.5 |
| 3 | 4.0 |
| 4 | 4.5 |
| 5 | 5.0 |
| 6 | 3.0 |
| 7 | 1.0 |
| 8 | 4.5 |
| 9 | 8.0 |
| 10 | 4.0 |
Any smart soluton for this?
Supplement: this question is similar to this question in stackoverflow but different in that the data is missing irregulaly.
Thank you.
Below is for BigQuery Standard SQL
select time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select time, value,
first_value(tick ignore nulls) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select time, t.time as tick, value
from (
select generate_array(min(time), max(time)) times
from `project.dataset.table`
), unnest(times) time
left join `project.dataset.table` t
window win1 as (order by time desc rows between current row and unbounded following),
win2 as (order by time rows between current row and unbounded following)
if to apply to sample data from your question - output is
Here is an example of how to solve this in Postgresql.
with data
as (select time
,lead(time) over(order by time) as next_time
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join generate_series(time, coalesce(next_time-1,time))
| time | generate_series | grp | val_grp |
| 1 | 1 | 0 | 3.0 |
| 1 | 2 | 1 | 3.500000000000000000000 |
| 1 | 3 | 2 | 4.000000000000000000000 |
| 1 | 4 | 3 | 4.500000000000000000000 |
| 5 | 5 | 0 | 5.0 |
| 5 | 6 | 1 | 3.00000000000000000 |
| 7 | 7 | 0 | 1.0 |
| 7 | 8 | 1 | 4.50000000000000000 |
| 9 | 9 | 0 | 8.0 |
| 10 | 10 | 0 | 4.0 |
I believe the syntax would be different in BigQuery using UNNEST and GENERATE_ARRAY as follows. You could give it a try.
with data
as (select time
,lead(time) over(order by time) as next_time
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join UNNEST(GENERATE_ARRAY(time, coalesce(next_time-1,time))) as generate_series
In BigQuery you can generate the extra rows for each row using generate_array()
. Then you can use lead()
to get information from the next row and some arithmetic for interpolation:
with t as (
select 1 as time, 3.0 as value union all
select 5 , 5.0 union all
select 7 , 1.0 union all
select 9 , 8.0 union all
select 10 , 4.0
tt as (
select t.*,
lead(time) over (order by time) as next_time,
lead(value) over (order by time) as next_value
from t
select coalesce(n, tt.time) as time,
(case when n = tt.time or n is null then value
else tt.value + (tt.next_value - tt.value) * (n - tt.time) / (tt.next_time - tt.time)
end) as value
from tt left join
unnest(generate_array(tt.time, tt.next_time - 1, 1)) n
on true
order by 1;
Note: You have a column called time
that contains an integer. If this is really a date/time data type of some type, I would suggest that you ask a new question with more appropriate sample data and desired results -- if you don't see how to adapt this answer.