How to group by time bucket in ClickHouse and fill missing data with nulls/0s

前端 未结 4 1269
迷失自我
迷失自我 2021-01-12 04:15

Suppose I have a given time range. For explanation, let\'s consider something simple, like whole year 2018. I want to query data from ClickHouse as a sum aggregation for eac

相关标签:
4条回答
  • 2021-01-12 04:26

    As an alternative for numbers() function in some cases range and array functions can be useful.

    Example: for each pair of (id1,id2) dates from the previous 7 days should be generated.

    SELECT
      id1,
      id2,
      arrayJoin(
        arrayMap( x -> today() - 7 + x, range(7) )
      ) as date2
    FROM table
    WHERE date >= now() - 7
    GROUP BY id1, id2
    

    The result of that select can be used in UNION ALL to fill the 'holes' in data.

    SELECT id1, id2, date, sum(column1)
    FROM (
      SELECT
        id1,
        id2,
        date,
        column1 
      FROM table
      WHERE date >= now() - 7
    
      UNION ALL 
    
      SELECT
        id1,
        id2,
        arrayJoin(
          arrayMap( x -> today() - 7 + x, range(7) )
        ) as date2,
        0 as column1
      FROM table
      WHERE date >= now() - 7
      GROUP BY id1, id2
    )
    GROUP BY id1, id2, date
    ORDER BY date, id1, id2
    
    0 讨论(0)
  • 2021-01-12 04:39

    Here is how I did it for hour buckets (needed to visualize this in Grafana) thanks to @filimonov and @mikhail

    SELECT t, SUM(metric) as metric FROM (
        SELECT 
            arrayJoin(
              arrayMap( x -> toStartOfHour(addHours(toDateTime($from),x)),
                  range(toUInt64(
                      dateDiff('hour', 
                          toDateTime($from), 
                          toDateTime($to)) + 1)))
            ) as t,
            0 as metric
    
        UNION ALL
    
        SELECT
            toStartOfHour(my_date) as t,
            COUNT(metric)
            FROM my_table
            WHERE t BETWEEN toDateTime($from) AND toDateTime($to)
            GROUP BY t
    )
    GROUP BY t ORDER BY t
    

    So, for example for range from 2019-01-01 to 2019-01-02 it will give you:

    SELECT t, SUM(metric) as metric FROM (
        SELECT 
            arrayJoin(
              arrayMap( x -> toStartOfHour(addHours(toDateTime('2019-01-01 00:00:00'),x)),
                  range(toUInt64(
                      dateDiff('hour', 
                          toDateTime('2019-01-01 00:00:00'), 
                          toDateTime('2019-01-02 00:00:00')) + 1)))
            ) as t,
            0 as metric
    
        UNION ALL
    
        SELECT
            toStartOfHour(my_date) as t,
            COUNT(1) as metric
            FROM my_table
            WHERE t BETWEEN toDateTime('2019-01-01 00:00:00') AND toDateTime('2019-01-02 00:00:00')
            GROUP BY t
    )
    GROUP BY t ORDER BY t;
    
    t                  |metric|
    -------------------|------|
    2019-01-01 00:00:00|     0|
    2019-01-01 01:00:00|     0|
    2019-01-01 02:00:00|     0|
    2019-01-01 03:00:00|     0|
    2019-01-01 04:00:00|     0|
    2019-01-01 05:00:00|     0|
    2019-01-01 06:00:00|     0|
    2019-01-01 07:00:00|105702|
    2019-01-01 08:00:00|113315|
    2019-01-01 09:00:00|149837|
    2019-01-01 10:00:00|185314|
    2019-01-01 11:00:00|246106|
    2019-01-01 12:00:00|323036|
    2019-01-01 13:00:00|     0|
    2019-01-01 14:00:00|409160|
    2019-01-01 15:00:00|379113|
    2019-01-01 16:00:00|256634|
    2019-01-01 17:00:00|286601|
    2019-01-01 18:00:00|280039|
    2019-01-01 19:00:00|248504|
    2019-01-01 20:00:00|218642|
    2019-01-01 21:00:00|186152|
    2019-01-01 22:00:00|148478|
    2019-01-01 23:00:00|109721|
    2019-01-02 00:00:00|     0|
    
    0 讨论(0)
  • 2021-01-12 04:43

    You can generate zero values using the "number" function. Then join your query and zero values using UNION ALL and already according to the obtained data we make a GROUP BY.

    So, your query will look like:

    SELECT SUM(metric),
           time
      FROM (
            SELECT toStartOfQuarter(toDate(1514761200+number*30*24*3600))  time,
                   toUInt16(0) AS metric
              FROM numbers(30)
    
         UNION ALL 
    
              SELECT toStartOfQuarter(created_at) AS time,
                   metric
              FROM mytable
             WHERE created_at >= toDate(1514761200)
               AND created_at >= toDateTime(1514761200)
               AND created_at <= toDate(1546210800)
               AND created_at <= toDateTime(1546210800)
           )
     GROUP BY time
     ORDER BY time
    

    note toUInt16(0) - zero values must be of the same type as metrics

    0 讨论(0)
  • 2021-01-12 04:51

    From ClickHouse 19.14 you can use the WITH FILL clause. It can fill quarters in this way:

    WITH
        (
            SELECT toRelativeQuarterNum(toDate('1970-01-01'))
        ) AS init
    SELECT
        -- build the date from the relative quarter number
        toDate('1970-01-01') + toIntervalQuarter(q - init) AS time,
        metric
    FROM
    (
        SELECT
            toRelativeQuarterNum(created_at) AS q,
            sum(rand()) AS metric
        FROM
        (
            -- generate some dates and metrics values with gaps
            SELECT toDate(arrayJoin(range(1514761200, 1546210800, ((60 * 60) * 24) * 180))) AS created_at
        )
        GROUP BY q
        ORDER BY q ASC WITH FILL FROM toRelativeQuarterNum(toDate(1514761200)) TO toRelativeQuarterNum(toDate(1546210800)) STEP 1
    )
    
    ┌───────time─┬─────metric─┐
    │ 2018-01-01 │ 2950782089 │
    │ 2018-04-01 │ 2972073797 │
    │ 2018-07-01 │          0 │
    │ 2018-10-01 │  179581958 │
    └────────────┴────────────┘
    
    0 讨论(0)
提交回复
热议问题