目录
一、连续登陆
1.1 连续登陆3天以上的用户
0 问题描述
1 数据准备
2 数据分析
3 小结
1.2 每个用户历史至今连续登录的最大天数
0 问题描述
1 数据准备
2 数据分析
3 小结
1.3 每个用户连续登录的最大天数(间断也算)
0 问题描述
1 数据准备
2 数据分析
3 小结
查询连续登陆3天以上的用户(字节面试题)
create table if not exists table1 (id int comment '用户id', `date` string comment'用户登录时间'); insert overwrite table table1 values (1,'2019-01-01 19:28:00'), (1,'2019-01-02 19:53:00'), (1,'2019-01-03 22:00:00'), (1,'2019-01-05 20:55:00'), (1,'2019-01-06 21:58:00'), (2,'2019-02-01 19:25:00'), (2,'2019-02-02 21:00:00'), (2,'2019-02-04 22:05:00'), (2,'2019-02-05 20:59:00'), (2,'2019-02-06 19:05:00'), (3,'2019-03-04 21:05:00'), (3,'2019-03-05 19:10:00'), (3,'2019-03-06 19:55:00'), (3,'2019-03-07 21:05:00');
select distinct id from (select id, diff from ( select id, date_sub(dt, row_number()over (partition by id order by dt)) diff from ( --- 同一个用户一天可能登陆多次,所以,先去重 select id, date_format(`date`,'yyyy-MM-dd') as dt from table1 -- current_date() 获取当前的年月日 where date_format(`date`,'yyyy-MM-dd') between date_sub(current_date(),7) and current_date() group by id, date_format(`date`,'yyyy-MM-dd') ) tmp1 ) tmp2 group by id, diff having count(1) >= 3) tmp3;
“连续登陆”的解题核心:分组排序,用时间减去排序,如果连续的话他们的差会是相同值
(1)对日期排序: row_number() over (partition by user_id oder by login_date)
(2)求日期和排序的差值diff:date_sub(login_date,row_number() over (partition by user_id oder by login_date)) as diff;
(3)对用户及差值diff分组:group by user_id,diff ;
(4)where count(1) >= 3的用户 user_id 就是连续登陆3天及以上的用户
查询每个用户历史至今连续登录的最大天数
create table if not exists table1 (id int comment '用户id', `date` string comment'用户登录时间'); insert overwrite table table1 values (1,'2019-01-01 19:28:00'), (1,'2019-01-02 19:53:00'), (1,'2019-01-03 22:00:00'), (1,'2019-01-05 20:55:00'), (1,'2019-01-06 21:58:00'), (2,'2019-02-01 19:25:00'), (2,'2019-02-02 21:00:00'), (2,'2019-02-04 22:05:00'), (2,'2019-02-05 20:59:00'), (2,'2019-02-06 19:05:00'), (3,'2019-03-04 21:05:00'), (3,'2019-03-05 19:10:00'), (3,'2019-03-06 19:55:00'), (3,'2019-03-07 21:05:00');
select id, max(cnt) as days from ( select id, count(1) as cnt from ( select id, `date`, date_sub(`date`, row_number() over (partition by id order by `date`)) diff from (--用户在同一天可能登录多次,需要去重 select id, date_format(`date`, 'yyyy-MM-dd') as `date` from table1 group by id, date_format(`date`, 'yyyy-MM-dd') ) tmp1 ) tmp2 group by id, diff ) tmp3 group by id;
“连续登陆”的解题核心:分组排序,用时间减去排序,如果连续的话他们的差会是相同值
(1)对日期排序: row_number() over (partition by user_id oder by login_date)
(2)求日期和排序的差值diff:date_sub(login_date,row_number() over (partition by user_id oder by login_date)) as diff;
(3)对用户及差值diff分组:select count(1) as cnt .......group by user_id,diff ;
(4)max(cnt)得到就是每个用户历史至今连续登陆的 最大天数
统计各用户最长的连续登录天数,间断一天也算作连续。例如:一个用户在1,3,5,6号登录,则视为连续6天登录。
create table if not exists table1 (id int comment '用户id', `date` string comment'用户登录时间'); insert overwrite table table1 values (1,'2019-01-01 19:28:00'), (1,'2019-01-02 19:53:00'), (1,'2019-01-03 22:00:00'), (1,'2019-01-05 20:55:00'), (1,'2019-01-06 21:58:00'), (2,'2019-02-01 19:25:00'), (2,'2019-02-02 21:00:00'), (2,'2019-02-04 22:05:00'), (2,'2019-02-05 20:59:00'), (2,'2019-02-06 19:05:00'), (3,'2019-03-04 21:05:00'), (3,'2019-03-05 19:10:00'), (3,'2019-03-06 19:55:00'), (3,'2019-03-07 21:05:00');
方式一:间断的那一天,构造array数组,利用炸裂函数进行补全,然后按照“用户连续登陆”的思路来做。
select id, max(cnt) as days from ( select id, diff, count(1) as cnt from ( select id, login_date, next_login_date, arr, new_login_date, date_sub(new_login_date, row_number() over (partition by id order by new_login_date)) diff from ( select id, login_date, next_login_date, arr, new_login_date from ( select id, login_date, next_login_date, --间断的那一天,构造array数组,利用炸裂函数进行补全 if( datediff(next_login_date, login_date) = 2, array(login_date, date_add(login_date, 1)), array(login_date) ) as arr from ( select id, login_date, --窗口函数 lead(向后取n行) --lead(column1,n,val)over(partition by column2 order by column3) 查询当前行的后边第n行数据,如果没有就为null lead(login_date, 1, '9999-12-31') over (partition by id order by login_date) next_login_date from (--用户在同一天可能登录多次,需要去重 select id, date_format(`date`, 'yyyy-MM-dd') as login_date from table1 group by id, date_format(`date`, 'yyyy-MM-dd') ) tmp1 ) tmp2 ) tmp3 lateral view explode(arr) tmp as new_login_date ) tmp4 ) tmp5 group by id, diff ) tmp6 group by id;
方式二:对用户多段stage的连续登陆进行划分,思路类似:会话划分
select id, max(diff) as days from ( select id, stage, datediff(max(login_date), min(login_date)) + 1 as diff from ( select id, login_date, -- 思路类似:会话划分,字符串拼接得到stage concat(id, '-', sum(start_point) over (partition by id order by login_date rows between unbounded preceding and current row )) stage from ( select id, login_date, --间隔一天也算连续,所以差值大于2的数据打上标签 if(datediff(login_date, last_login_date) > 2, 1, 0) start_point from ( select id, login_date, --窗口函数 lag(向前取n行) --lag(column1,n,val)over(partition by column2 order by column3) 查询当前行的前边第n行数据,如果没有就为null lag(login_date, 1, '1970-01-01') over (partition by id order by login_date) as last_login_date from ( select id, date_format(`date`, 'yyyy-MM-dd') as login_date from table1 group by id, date_format(`date`, 'yyyy-MM-dd') ) tmp1 ) tmp2 ) tmp3 ) tmp4 group by id, stage ) tmp5 group by id;
“间断连续”类型的解题思路:
(1)构造array数组;
(2)炸裂函数+ 侧写视图 : lateral view +explode将一行变多行,补全间断的那几天
(3)补全后之后就按照“连续登陆”问题的解决思路进行处理