increase() in Prometheus sometimes doubles values: how to avoid?

后端 未结 2 1661
花落未央
花落未央 2021-02-05 07:51

I\'ve found that for some graphs I get doubles values from Prometheus where should be just ones:

Query I use:

increase(signups_count[4m])


        
相关标签:
2条回答
  • 2021-02-05 07:59

    increase() will always (approximately) double the actual increase with your setup.

    The reason is that (as currently implemented):

    1. increase() is (as you observed) syntactic sugar for rate() i.e. it is the value that would be returned by rate() multiplied by the number of seconds in the range you specified. In your case, it is rate() * 240.
    2. rate() uses extrapolation in its computation. In the vast majority of cases a 4 minute range will return exactly 2 data points, almost exactly 2 minutes apart. The rate is then computed as the difference between last and first (i.e. the 2 points in your case) divided by the time difference of the 2 points (around 120 seconds in 99.99% of cases) multiplied by the range you requested (exactly 240 seconds). So if the increase between the 2 points is zero, the rate is zero. If the increase between the 2 points is 1.0, the computed rate() will be close to 2.0 / 240 and, as a result, the increase() will be 2.0.

    This approach works mostly fine with counters that increase smoothly (e.g. if you have a more or less fixed number of signups every 2 minutes). But with a counter that rarely increases (as does your signups counter) or a spiky counter (like CPU usage) you get weird overestimates (like the increase of 2 you are seeing).

    You can essentially reverse engineer Prometheus' implementation and get (something very close to) the actual increase by multiplying with (requested_range - scrape interval) and dividing by requested_range, essentially walking back the extrapolation that Prometheus does.

    In your case, this would mean

    increase(signups_count[4m]) * (240 - 120) / 240
    

    or, more succinctly,

    increase(signups_count[4m]) / 2
    

    It requires you to be aware both of the length of the range and the scrape interval, but it will give you what you want: "ones for ones, and twos for twos, most of the time". Sometimes you'll get 1.01 instead of 1.0 because the scrapes were 119 seconds, not 120 seconds apart and sometimes, if your evaluation is closely aligned with the scrape some points right on the boundary might be included or not in a data point calculation, but it's still a better answer than 2.0.

    0 讨论(0)
  • 2021-02-05 08:17

    This is known as aliasing and is a fundamental problem in signal processing. You can improve this a bit by increasing your sample rate, a 4m range is a bit short with a 2m range. Try a 10m range.

    Here for example the query executed at 1515722220 only sees the 580@1515722085.194 and 581@1515722205.194 samples. That's an increase of 1 over 2 minutes, which extrapolated over 4 minutes is an increase of 2 - which is as expected.

    Any metrics-based monitoring system will have similar artifacts, if you want 100% accuracy you need logs.

    0 讨论(0)
提交回复
热议问题