问题
I currently have a Prometheus alert that fires when my success rate drops below 85%.
I would like to add the absolute numbers of the ratio to the alert description. How do I do that?
My YAML currently looks like this (I cleaned up some extraneous details):
groups:
- name: recording_rules
rules:
- record: number_of_successes_24h
expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d",status=~"success"}))
- record: number_of_total_24h
expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d"}))
- record: success_rate_24h
expr: clamp_max(number_of_successes_24h / number_of_total_24h * 100, 100)
- name: alerting_rules
rules:
- alert: LowSuccessRate24H
expr: success_rate_24h < 85
labels:
severity: critical
annotations:
summary: "CRITICAL: Low success rate 24h"
description: "Success rate in the last 24 hours went below 85% (value: {{ $value }}%)"
My question is, how do I add the number_of_successes_24h
and number_of_total_24h
into the description?
I read the official documentation at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/, but I got lost; I searched SO, but I didn't find anything relevant.
I read that there were extra details available in $labels
, so I tried printing that as an example to see what was in it, but I got map[__name__:success_rate_24h]
, and I couldn't figure out how to see inside that.
Partial answers and guides welcome. Thanks.
回答1:
Here's a simplified version of my TasksMissing
alert, which outputs the number of tasks missing, the total number of tasks and the affected instances in the summary:
- alert: TasksMissing
expr: |
job_env:up:ratio < .7
for: 2m
labels:
severity: warning
annotations:
summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
description:
'{{ with printf `job_env:up:count{job="%s",env="%s"} - job_env:up:sum{job="%s",env="%s"}` $labels.job $labels.env $labels.job $labels.env | query }}
{{- . | first | value -}}
{{ end }}
of
{{ with printf `job_env:up:count{job="%s",env="%s"}` $labels.job $labels.env | query }}
{{- . | first | value -}}
{{ end }}
{{ $labels.job }} instances are missing in {{ $labels.env }}:
{{ range printf `up{job="%s",env="%s"}==0` $labels.job $labels.env | query }}
{{- .Labels.instance }}
{{ end }}'
The resulting description is expected read something like "2 of 3 foo-service instances are missing in prod: foo01.prod.foo.org:8080 foo02.prod.foo.org:8080".
The idea is that you use Go templates to generate a query (by populating a template with values from $labels
using printf
) and then pipe that into the Prometheus-defined query function and get back either one result (that you can handle using with
) or multiple values (that you can iterate over using range
). Then you can print either the timeseries value directly or some label (e.g. the instance name).
来源:https://stackoverflow.com/questions/56654598/how-to-make-prometheus-alert-description-give-both-ratio-and-absolute-numbers