I currently have a Prometheus alert that fires when my success rate drops below 85%.
I would like to add the absolute numbers of the ratio to the alert description.
Here's a simplified version of my TasksMissing
alert, which outputs the number of tasks missing, the total number of tasks and the affected instances in the summary:
- alert: TasksMissing
expr: |
job_env:up:ratio < .7
for: 2m
labels:
severity: warning
annotations:
summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
description:
'{{ with printf `job_env:up:count{job="%s",env="%s"} - job_env:up:sum{job="%s",env="%s"}` $labels.job $labels.env $labels.job $labels.env | query }}
{{- . | first | value -}}
{{ end }}
of
{{ with printf `job_env:up:count{job="%s",env="%s"}` $labels.job $labels.env | query }}
{{- . | first | value -}}
{{ end }}
{{ $labels.job }} instances are missing in {{ $labels.env }}:
{{ range printf `up{job="%s",env="%s"}==0` $labels.job $labels.env | query }}
{{- .Labels.instance }}
{{ end }}'
The resulting description is expected read something like "2 of 3 foo-service instances are missing in prod: foo01.prod.foo.org:8080 foo02.prod.foo.org:8080".
The idea is that you use Go templates to generate a query (by populating a template with values from $labels
using printf
) and then pipe that into the Prometheus-defined query function and get back either one result (that you can handle using with
) or multiple values (that you can iterate over using range
). Then you can print either the timeseries value directly or some label (e.g. the instance name).