How do I get consistent values with influxdb non_negative_derivative?

前端 未结 3 597
难免孤独
难免孤独 2021-02-01 07:23

Using grafana with influxdb, I am trying to show the per-second rate of some value that is a counter. If I use the non_negative_derivative(1s) function, the value

3条回答
  •  生来不讨喜
    2021-02-01 08:03

    @Michael-Desa gives an excellent explanation.

    I'd like to augment that answer with a solution to a pretty common metric our company is interested in: "What is the maximum "operation per second" value on a specific measurement field?".

    I will use a real-life example from our company.

    Scenario Background

    We send a lot of data from an RDBMS to redis. When transferring that data, we keep track of 5 counters:

    1. TipTrgUp -> Updates by a business trigger (stored procedure)
    2. TipTrgRm -> Removes by a business trigger (stored procedure)
    3. TipRprUp -> Updates by an unattended auto-repair batch process
    4. TipRprRm -> Removes by an unattended auto-repair batch process
    5. TipDmpUp -> Updates by a bulk-dump process

    We made a metrics collector that sends the current state of these counters to InfluxDB, with an interval of 1 second (configurable).

    Grafana graph 1: low resolution, no true max ops

    Here is the grafana query that is useful, but does not show the true max ops when zoomed out (we know it will go to around 500 ops on a normal business day, when no special dumps or maintenance is taking place - otherwise it goes into the thousands):

    SELECT
        non_negative_derivative(max(TipTrgUp),1s) AS "update/TipTrgUp"
       ,non_negative_derivative(max(TipTrgRm),1s) AS "remove/TipTrgRm"
       ,non_negative_derivative(max(TipRprUp),1s) AS "autorepair-up/TipRprUp"
       ,non_negative_derivative(max(TipRprRm),1s) AS "autorepair-rm/TipRprRm"
       ,non_negative_derivative(max(TipDmpUp),1s) AS "dump/TipDmpUp"
    FROM "$rp"."redis_flux_-transid-d-s"
    WHERE
        host =~ /$server$/
        AND $timeFilter
    GROUP BY time($interval),* fill(null)
    

    Sidenotes: $rp is the name of the retention policy, templated in grafana. We use CQ's to downsample to retention policies with a larger duration. Also note the 1s as a derivative parameter: it is needed, since the default is different when using GROUP BY. This can be easily overlooked in the InfluxDB documentation.

    The graph, seen by 24 hours looks like this:

    If we simply use a resolution of 1s (as suggested by @Michael-Desa), an enormous amount of data is transferred from influxdb to the client. It works reasonably well (about 10 seconds), but too slow for us.

    Grafana graph 2: low and high resolution, true max ops, slow performance

    We can however use subqueries to add the true maxops to this graph, which is a slight improvement. A lot less data is transferred to the client, but the InfluxDB server has to do a lot of number crunching. Series B (with maxops prepended in the aliases):

    SELECT
        max(subTipTrgUp) AS maxopsTipTrgUp
       ,max(subTipTrgRm) AS maxopsTipTrgRm
       ,max(subTipRprUp) AS maxopsRprUp
       ,max(subTipRprRm) AS maxopsTipRprRm
       ,max(subTipDmpUp) AS maxopsTipDmpUp
    FROM (
        SELECT
            non_negative_derivative(max(TipTrgUp),1s) AS subTipTrgUp
           ,non_negative_derivative(max(TipTrgRm),1s) AS subTipTrgRm
           ,non_negative_derivative(max(TipRprUp),1s) AS subTipRprUp
           ,non_negative_derivative(max(TipRprRm),1s) AS subTipRprRm
           ,non_negative_derivative(max(TipDmpUp),1s) AS subTipDmpUp
        FROM "$rp"."redis_flux_-transid-d-s"
        WHERE
            host =~ /$server$/
            AND $timeFilter
        GROUP BY time(1s),* fill(null)
    )
    WHERE $timeFilter
    GROUP BY time($interval),* fill(null)
    

    Gives:

    Grafana graph 3: low and high resolution, true max ops, high performance, pre-calculate by CQ

    Our final solution to these kind of metrics (but only when we need a live view, the subquery approach works fine for ad-hoc graphs) is: use a Continuous Query to pre-calculate the true maxops. We generate CQ's like this:

    CREATE CONTINUOUS QUERY "redis_flux_-transid-d-s.maxops.1s"
    ON telegraf
    BEGIN
        SELECT
            non_negative_derivative(max(TipTrgUp),1s) AS TipTrgUp
           ,non_negative_derivative(max(TipTrgRm),1s) AS TipTrgRm
           ,non_negative_derivative(max(TipRprUp),1s) AS TipRprUp
           ,non_negative_derivative(max(TipRprRm),1s) AS TipRprRm
           ,non_negative_derivative(max(TipDmpUp),1s) AS TipDmpUp
        INTO telegraf.A."redis_flux_-transid-d-s.maxops"
        FROM telegraf.A."redis_flux_-transid-d-s"
        GROUP BY time(1s),*
    END
    

    From here on, it's trivial to use these maxops measurements in grafana. When downsampling to an RP with longer retention, we again use max() as the selector function.

    Series B (with .maxops appended in the aliases)

    SELECT
        max(TipTrgUp) AS "update/TipTrgUp.maxops"
       ,max(TipTrgRm) AS "remove/TipTrgRm.maxops"
       ,max(TipRprUp) as "autorepair-up/TipRprUp.maxops"
       ,max(TipRprRm) as "autorepair-rm/TipRprRm.maxops"
       ,max(TipDmpUp) as "dump/TipDmpUp.maxops"
    FROM "$rp"."redis_flux_-transid-d-s.maxops"
    WHERE
        host =~ /$server$/
        AND $timeFilter
    GROUP BY time($interval),* fill(null)
    

    Gives:

    When zoomed in to 1s precision, you can see that the graphs become identical:

    Hope this helps, TW

提交回复
热议问题