问题
I am looking for whether I can compare the total number of event count for the current one hr interval
with the total number of event count with the previous one hour interval
and if the current hour count is less
than previous hour count then one email should get triggered from Riemann
.
I am not sure whether we can store the value and compare it with the current event value because I learned events will get expired due to TTL
option in Riemann.
Please correct me if I am wrong and suggest me a reference code to achieve it in Riemann
.
Thanks in advance
回答1:
It sounds like you want the rate of change of the count over an hour and then to decide if that rate is negative? One way to do this is just as you describe:
(fold-interval-metric 3600 folds/count
(fixed-event-window 2
(smap folds/difference
(where (neg? (:metric event))
email))))
and this makes sense. You may find that if you use the built in derivative over time function ddt
that and graph it you can spot these problems over much shorter timescales. If your success rate falls to zero on minute three of an hour 57 minutes is a long time for the computer to wait before it calls a human for help. If the rate of change on a 15 minute period approches negative infinity it's very likely that your service just stopped.
I'm fond of wrapping ddt
in the exponential weighted moving average ewma
so spikes don't set off the alarms and have had an extremely low false positive rate with this pattern:
(ewma 30 (ddt ...your stuff here...))
I often want to compare the rate of the requests to a service with the responses with this pattern which uses ewma
ddt
and project
:
(pipe ↲ (splitp = service
"service:input" (ewma 30 ↲)
"service:output" (ewma 30 ↲)
bit-bucket) ;; throw out other services here
(project [(service "service:input")
(service "service:output")]
(smap folds/quotient-sloppy
(with :service "service-ratio-rate-of-change"
(ddt ...your streams here...)))))
If requests are infrequent you will need to play with the interval in all these examples to ensure that the alarms don't go off between events. If your events are infrequent you may also need to set the :ttl on the events high enough that they don't expire while you are agrigating them.
ps: the ↲ can be any symbol(s) you want, I just chose that unicode character.
pss: a false posative rate of one alarm per quarter should be reasonable if you consider these things carefully.
来源:https://stackoverflow.com/questions/39019835/how-to-compare-event-count-value-with-previous-time-interval-event