Hive query generating identifiers for a sequence of row matching a condition

前端 未结 3 1106
执笔经年
执笔经年 2021-01-23 11:54

Let\'s say I have the following hive table as input, let\'s call it connections:

userid  | timestamp   
--------|-------------
1       | 1433258019          


        
相关标签:
3条回答
  • 2021-01-23 11:57

    Interesting question. Per your comment to @Madhu, I added the line 2 1433258172 to your example. What you need is to increment every time timediff > 60 is satisfied. The easiest way to do this is to flag it and then cumulatively sum over the window.

    Query:

    select userid
      , timestamp
      , concat('user', userid, '-session-', s_sum) sessionid
    from (
      select *
        , sum( counter ) over (partition by userid
                               order by timestamp asc
                               rows between unbounded preceding and current row) s_sum
      from (
        select *
          , case when timediff > 60 then 1 else 0 end as counter
        from (
          select userid
            , timestamp
            , timestamp - lag(timestamp, 1, 0) over (partition by userid
                                                     order by timestamp asc) timediff
          from connections ) x ) y ) z
    

    Output:

    1   1433258019  user1-session-1
    1   1433258020  user1-session-1
    2   1433258080  user2-session-1
    2   1433258083  user2-session-1
    2   1433258088  user2-session-1
    2   1433258170  user2-session-2
    2   1433258172  user2-session-2
    
    0 讨论(0)
  • 2021-01-23 12:06

    Use the following select concat_ws('-',name, city) from employee; the first parameter of concat_ws is separator. name and city are column names for employee table. See that they are of type strings. You can look here for more

    0 讨论(0)
  • 2021-01-23 12:18

    This works:

    SELECT 
      userid,
      timestamp,
      timediff,
      CONCAT(
        'user',
         userid,
         '-',
         'session-',
         CAST(timediff / 60 AS INT) + 1
      ) AS session_id
      FROM (
        SELECT   
          userid,
          timestamp,
          timestamp - LAG(timestamp, 1, timestamp) OVER w AS timediff
        FROM connections
        WINDOW w AS (
          PARTITION BY userid
          ORDER BY timestamp ASC
        )
    ) a;
    

    OUTPUT:

    userid  timestamp   timediff    session_state
    1       1433258019  0.0         user1-session-1
    1       1433258020  1.0         user1-session-1
    2       1433258080  0.0         user2-session-1
    2       1433258083  3.0         user2-session-1
    2       1433258088  5.0         user2-session-1
    2       1433258170  82.0        user2-session-2
    3       1433258270  0.0         user3-session-1
    

    you can try something like this if timediff is not required:

    select userid,timestamp ,session_count+ concat('user',userid,'-','session-',cast(LAG(session_count-1,1,0) over w1 as string)) AS session_state
    --LAG(session_count-1,1,0) over w1 AS session_count_new FROM (select userid, timestamp, timediff, cast (timediff/60 as int)+1 as session_count

    0 讨论(0)
提交回复
热议问题