背景
对于数据开发人员来说,手写sql是比较熟悉的了,就有这样一道题,面试时需要手写sql,这就是非常经典的连续登录问题,大厂小厂都爱问,这种题说简单也不简单,说难也不难,关键是要有思路。
真题
hql统计连续登陆的三天及以上的用户
这个问题可以扩展到很多相似的问题:连续几个月充值会员、连续天数有商品卖出、连续打车、连续逾期。
数据提供
用户ID、登入日期
user01,2018-02-28
user01,2018-03-01
user01,2018-03-02
user01,2018-03-04
user01,2018-03-05
user01,2018-03-06
user01,2018-03-07
user02,2018-03-01
user02,2018-03-02
user02,2018-03-03
user02,2018-03-06
输出字段
+---------+--------+-------------+-------------+--+
| uid | times | start_date | end_date |
+---------+--------+-------------+-------------+--+
群内讨论
这道题在群里发出后,大家就展开了激烈的讨论:
解决方案
可以看出来,有很多种不同的解决方案。
这里就为大家提供一种比较常见的方案:
建表
create table wedw_dw.t_login_info(
user_id string COMMENT '用户ID'
,login_date date COMMENT '登录日期'
)
row format delimited
fields terminated by ',';
导数据
hdfs dfs -put /test/login.txt /data/hive/test/wedw/dw/t_login_info/
验证数据
select * from wedw_dw.t_login_info;
+----------+-------------+--+
| user_id | login_date |
+----------+-------------+--+
| user01 | 2018-02-28 |
| user01 | 2018-03-01 |
| user01 | 2018-03-02 |
| user01 | 2018-03-04 |
| user01 | 2018-03-05 |
| user01 | 2018-03-06 |
| user01 | 2018-03-07 |
| user02 | 2018-03-01 |
| user02 | 2018-03-02 |
| user02 | 2018-03-03 |
| user02 | 2018-03-06 |
+----------+-------------+--+
解决方案
select
t2.user_id as user_id
,count(1) as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
select
t1.user_id
,t1.login_date
,date_sub(t1.login_date,rn) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
) t1
) t2
group by
t2.user_id
,t2.date_diff
having times >= 3
;
结果
+----------+--------+-------------+-------------+--+
| user_id | times | start_date | end_date |
+----------+--------+-------------+-------------+--+
| user01 | 3 | 2018-02-28 | 2018-03-02 |
| user01 | 4 | 2018-03-04 | 2018-03-07 |
| user02 | 3 | 2018-03-01 | 2018-03-03 |
+----------+--------+-------------+-------------+--+
思路
先把数据按照用户id分组,根据登录日期排序
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
+----------+-------------+-----+--+
| user_id | login_date | rn |
+----------+-------------+-----+--+
| user01 | 2018-02-28 | 1 |
| user01 | 2018-03-01 | 2 |
| user01 | 2018-03-02 | 3 |
| user01 | 2018-03-04 | 4 |
| user01 | 2018-03-05 | 5 |
| user01 | 2018-03-06 | 6 |
| user01 | 2018-03-07 | 7 |
| user02 | 2018-03-01 | 1 |
| user02 | 2018-03-02 | 2 |
| user02 | 2018-03-03 | 3 |
| user02 | 2018-03-06 | 4 |
+----------+-------------+-----+--+
2.用登录日期减去排序数字rn,得到的差值日期如果是相等的,则说明这两天肯定是连续的
select
t1.user_id
,t1.login_date
,date_sub(t1.login_date,rn) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
) t1
;
+----------+-------------+-------------+--+
| user_id | login_date | date_diff |
+----------+-------------+-------------+--+
| user01 | 2018-02-28 | 2018-02-27 |
| user01 | 2018-03-01 | 2018-02-27 |
| user01 | 2018-03-02 | 2018-02-27 |
| user01 | 2018-03-04 | 2018-02-28 |
| user01 | 2018-03-05 | 2018-02-28 |
| user01 | 2018-03-06 | 2018-02-28 |
| user01 | 2018-03-07 | 2018-02-28 |
| user02 | 2018-03-01 | 2018-02-28 |
| user02 | 2018-03-02 | 2018-02-28 |
| user02 | 2018-03-03 | 2018-02-28 |
| user02 | 2018-03-06 | 2018-03-02 |
+----------+-------------+-------------+--+
3.根据user_id和日期差date_diff 分组,最小登录日期即为此次连续登录的开始日期start_date,最大登录日期即为结束日期end_date,登录次数即为分组后的count(1)
select
t2.user_id as user_id
,count(1) as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
select
t1.user_id
,t1.login_date
,date_sub(t1.login_date,rn) as date_diff
from
(
select
user_id
,login_date
,row_number() over(partition by user_id order by login_date asc) as rn
from
wedw_dw.t_login_info
) t1
) t2
group by
t2.user_id
,t2.date_diff
having times >= 3
;
+----------+--------+-------------+-------------+--+
| user_id | times | start_date | end_date |
+----------+--------+-------------+-------------+--+
| user01 | 3 | 2018-02-28 | 2018-03-02 |
| user01 | 4 | 2018-03-04 | 2018-03-07 |
| user02 | 3 | 2018-03-01 | 2018-03-03 |
+----------+--------+-------------+-------------+--+
结束语
以上仅提供了一种解决方案,小伙伴有其他方案的话,可以进群交流哦
本文分享自微信公众号 - 大数据私房菜(datagogogo)。
如有侵权,请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。
来源:oschina
链接:https://my.oschina.net/u/4631230/blog/4544698