How to make Prometheus alert description give both ratio and absolute numbers?

痴心易碎 提交于 2019-12-01 10:28:23

问题


I currently have a Prometheus alert that fires when my success rate drops below 85%.

I would like to add the absolute numbers of the ratio to the alert description. How do I do that?

My YAML currently looks like this (I cleaned up some extraneous details):

groups:
  - name: recording_rules
    rules:
    - record: number_of_successes_24h
      expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d",status=~"success"}))
    - record: number_of_total_24h
      expr: avg(sum by(instance)(my_status{kubernetes_name="my-prom",timeRange="1d"}))
    - record: success_rate_24h
      expr: clamp_max(number_of_successes_24h / number_of_total_24h * 100, 100)

  - name: alerting_rules
    rules:
    - alert: LowSuccessRate24H
      expr: success_rate_24h < 85
      labels:
        severity: critical
      annotations:
        summary: "CRITICAL: Low success rate 24h"
        description: "Success rate in the last 24 hours went below 85% (value: {{ $value }}%)"

My question is, how do I add the number_of_successes_24h and number_of_total_24h into the description?
I read the official documentation at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/, but I got lost; I searched SO, but I didn't find anything relevant.

I read that there were extra details available in $labels, so I tried printing that as an example to see what was in it, but I got map[__name__:success_rate_24h], and I couldn't figure out how to see inside that.

Partial answers and guides welcome. Thanks.


回答1:


Here's a simplified version of my TasksMissing alert, which outputs the number of tasks missing, the total number of tasks and the affected instances in the summary:

  - alert: TasksMissing
    expr: |
      job_env:up:ratio < .7
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
      description:
       '{{ with printf `job_env:up:count{job="%s",env="%s"} - job_env:up:sum{job="%s",env="%s"}` $labels.job $labels.env $labels.job $labels.env | query }}
          {{- . | first | value -}}
        {{ end }}
        of
        {{ with printf `job_env:up:count{job="%s",env="%s"}` $labels.job $labels.env | query }}
          {{- . | first | value -}}
        {{ end }}
        {{ $labels.job }} instances are missing in {{ $labels.env }}:
        {{ range printf `up{job="%s",env="%s"}==0` $labels.job $labels.env | query }}
          {{- .Labels.instance }}
        {{ end }}'

The resulting description is expected read something like "2 of 3 foo-service instances are missing in prod: foo01.prod.foo.org:8080 foo02.prod.foo.org:8080".

The idea is that you use Go templates to generate a query (by populating a template with values from $labels using printf) and then pipe that into the Prometheus-defined query function and get back either one result (that you can handle using with) or multiple values (that you can iterate over using range). Then you can print either the timeseries value directly or some label (e.g. the instance name).



来源:https://stackoverflow.com/questions/56654598/how-to-make-prometheus-alert-description-give-both-ratio-and-absolute-numbers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!