问题
I have installs
table with installs that have the same user_id
but different install_date
.
I want to get all revenue records joined with nearest install record by install_date
that is less then revenue_date
because I need it's source
field value for next processing.
That means that output rows count should be equal to revenue table records.
How can it be achieved in BigQuery?
Here is the data:
installs
install_date user_id source
--------------------------------
2020-01-10 user_a source_I
2020-01-15 user_a source_II
2020-01-20 user_a source_III
***info about another users***
revenue
revenue_date user_id revenue
--------------------------------------------
2020-01-11 user_a 10
2020-01-21 user_a 20
***info about another users***
回答1:
Consider below solution
select any_value(r).*,
array_agg(
(select as struct i.* except(user_id))
order by install_date desc
limit 1
)[offset(0)].*
from `project.dataset.revenue` r
join `project.dataset.installs` i
on i.user_id = r.user_id
and install_date < revenue_date
group by format('%t', r)
If applied to sample data in your question - output is
回答2:
You may be able to use left join
for this:
select r.*, i.* except (user_id)
from revenue r left join
(select i.*,
lead(install_date) over (partition by user_id order by install_date) as next_install_date
from installs i
) i
on r.user_id = i.user_id and
r.revenue_date >= i.install_date and
(r.revenue_date < i.next_install_date or i.next_install_date is null);
I have had problems in the past with left join
s and inequalities. However, I think this will work now in BQ.
来源:https://stackoverflow.com/questions/65886871/join-by-nearest-date-for-the-table-with-duplicate-records-in-bigquery