increase() in Prometheus sometimes doubles values: how to avoid?

后端 未结 2 1659
花落未央
花落未央 2021-02-05 07:51

I\'ve found that for some graphs I get doubles values from Prometheus where should be just ones:

Query I use:

increase(signups_count[4m])
         


        
2条回答
  •  孤街浪徒
    2021-02-05 07:59

    increase() will always (approximately) double the actual increase with your setup.

    The reason is that (as currently implemented):

    1. increase() is (as you observed) syntactic sugar for rate() i.e. it is the value that would be returned by rate() multiplied by the number of seconds in the range you specified. In your case, it is rate() * 240.
    2. rate() uses extrapolation in its computation. In the vast majority of cases a 4 minute range will return exactly 2 data points, almost exactly 2 minutes apart. The rate is then computed as the difference between last and first (i.e. the 2 points in your case) divided by the time difference of the 2 points (around 120 seconds in 99.99% of cases) multiplied by the range you requested (exactly 240 seconds). So if the increase between the 2 points is zero, the rate is zero. If the increase between the 2 points is 1.0, the computed rate() will be close to 2.0 / 240 and, as a result, the increase() will be 2.0.

    This approach works mostly fine with counters that increase smoothly (e.g. if you have a more or less fixed number of signups every 2 minutes). But with a counter that rarely increases (as does your signups counter) or a spiky counter (like CPU usage) you get weird overestimates (like the increase of 2 you are seeing).

    You can essentially reverse engineer Prometheus' implementation and get (something very close to) the actual increase by multiplying with (requested_range - scrape interval) and dividing by requested_range, essentially walking back the extrapolation that Prometheus does.

    In your case, this would mean

    increase(signups_count[4m]) * (240 - 120) / 240
    

    or, more succinctly,

    increase(signups_count[4m]) / 2
    

    It requires you to be aware both of the length of the range and the scrape interval, but it will give you what you want: "ones for ones, and twos for twos, most of the time". Sometimes you'll get 1.01 instead of 1.0 because the scrapes were 119 seconds, not 120 seconds apart and sometimes, if your evaluation is closely aligned with the scrape some points right on the boundary might be included or not in a data point calculation, but it's still a better answer than 2.0.

提交回复
热议问题