问题
I have data which has missing values irregulaly, and I'd like to convert it with a certain interval with liner interpolation using BigQuery Standard SQL.
Specifically, I have data like this:
# data is missing irregulary
+------+-------+
| time | value |
+------+-------+
| 1 | 3.0 |
| 5 | 5.0 |
| 7 | 1.0 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+-------+
and I'd like to convert this table as follows:
# interpolated with interval of 1
+------+--------------------+
| time | value_interpolated |
+------+--------------------+
| 1 | 3.0 |
| 2 | 3.5 |
| 3 | 4.0 |
| 4 | 4.5 |
| 5 | 5.0 |
| 6 | 3.0 |
| 7 | 1.0 |
| 8 | 4.5 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+--------------------+
Any smart soluton for this?
Supplement: this question is similar to this question in stackoverflow but different in that the data is missing irregulaly.
Thank you.
回答1:
Below is for BigQuery Standard SQL
#standardSQL
select time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select time, value,
first_value(tick ignore nulls) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select time, t.time as tick, value
from (
select generate_array(min(time), max(time)) times
from `project.dataset.table`
), unnest(times) time
left join `project.dataset.table` t
using(time)
)
window win1 as (order by time desc rows between current row and unbounded following),
win2 as (order by time rows between current row and unbounded following)
)
if to apply to sample data from your question - output is
回答2:
Here is an example of how to solve this in Postgresql.
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c560dd9a8db095920d0a15834b6768f1
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join generate_series(time, coalesce(next_time-1,time))
+------+-----------------+-----+-------------------------+
| time | generate_series | grp | val_grp |
+------+-----------------+-----+-------------------------+
| 1 | 1 | 0 | 3.0 |
| 1 | 2 | 1 | 3.500000000000000000000 |
| 1 | 3 | 2 | 4.000000000000000000000 |
| 1 | 4 | 3 | 4.500000000000000000000 |
| 5 | 5 | 0 | 5.0 |
| 5 | 6 | 1 | 3.00000000000000000 |
| 7 | 7 | 0 | 1.0 |
| 7 | 8 | 1 | 4.50000000000000000 |
| 9 | 9 | 0 | 8.0 |
| 10 | 10 | 0 | 4.0 |
+------+-----------------+-----+-------------------------+
I believe the syntax would be different in BigQuery using UNNEST and GENERATE_ARRAY as follows. You could give it a try.
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join UNNEST(GENERATE_ARRAY(time, coalesce(next_time-1,time))) as generate_series
回答3:
In BigQuery you can generate the extra rows for each row using generate_array()
. Then you can use lead()
to get information from the next row and some arithmetic for interpolation:
with t as (
select 1 as time, 3.0 as value union all
select 5 , 5.0 union all
select 7 , 1.0 union all
select 9 , 8.0 union all
select 10 , 4.0
),
tt as (
select t.*,
lead(time) over (order by time) as next_time,
lead(value) over (order by time) as next_value
from t
)
select coalesce(n, tt.time) as time,
(case when n = tt.time or n is null then value
else tt.value + (tt.next_value - tt.value) * (n - tt.time) / (tt.next_time - tt.time)
end) as value
from tt left join
unnest(generate_array(tt.time, tt.next_time - 1, 1)) n
on true
order by 1;
Note: You have a column called time
that contains an integer. If this is really a date/time data type of some type, I would suggest that you ask a new question with more appropriate sample data and desired results -- if you don't see how to adapt this answer.
来源:https://stackoverflow.com/questions/64816885/how-to-fill-irregularly-missing-values-with-linear-interepolation-in-bigquery