zabbix监控kafka Consumer 监控和告警

喜欢而已 提交于 2020-04-29 11:16:24

一、环境

    RHEL 6.5

    burrow 版本1.1

         LinkedIn 公司的数据基础设施Streaming SRE团队正在积极开发Burrow,该软件由Go语言编写,在Apache许可证下发布,并托管在 GitHub Burrow 上。

        它收集集群消费者群组的信息,并为每个群组计算出一个单独的状态,告诉我们群组是否运行正常,是否落后,速度是否变慢或者是否已经停止工作,以此来完成对消费者状态的监控。它不需要通过监控群组的进度来获得阈值,不过用户仍然可以从中获得消息的延时数量。

    kafka版本2.10

二、安装burrow

     Kafka 消费迟滞监控工具 Burrow

Burrow自动监控所有消费者和他们消费的每个分区。它通过消费特殊的内部Kafka主题来消费者偏移量。然后,Burrow将消费者信息作为与任何单个消费者分开的集中式服务提供。消费者状态通过评估滑动窗口中的消费者行为来确定。

这些信息被分解成每个分区的状态,然后转化为Consumer的单一状态。消费状态可以是OK,或处于WARNING状态(Consumer正在工作但消息消费落后),或处于ERROR状态(Consumer已停止消费或离线)。此状态可通过简单的HTTP请求发送至Burrow获取状态,也可以通过Burrow 定期检查并使用通知其通过电子邮件或单独的HTTP endpoint接口(例如监视或通知系统)发送出去。

Burrow能够监控Consumer消费消息的延迟,从而监控应用的健康状况,并且可以同时监控多个Kafka集群。用于获取关于Kafka集群和消费者的信息的HTTP上报服务与滞后状态分开,对于在无法运行Java Kafka客户端时有助于管理Kafka集群的应用程序非常有用。

Burrow 是基于 Go 语言开发,当前 Burrow 的 v1.1 版本已经release。
Burrow 也提供用于 docker 镜像。

安装方法可以选用源码编译,和使用官方提供的二进制包等方法。

这里推荐使用二进制包的方式。

Burrow 是无本地状态存储的,CPU密集型,网络IO密集型应用。

[root@IIMAPP02 ~]# cd /opt/
[root@IIMAPP02 opt]# mkdir burrow
[root@IIMAPP02 opt]# cd burrow/
[root@IIMAPP02burrow]#wget https://github.com/linkedin/Burrow/releases/download/v1.1.0/Burrow_1.1.0_linux_amd64.tar.gz
[root@IIMAPP02 burrow]# tar xvf Burrow_1.1.0_linux_amd64.tar.gz
[root@IIMAPP02 burrow]# ls
burrow                           burrow.out                      burrow.pid    config   NOTICE
Burrow_1.1.0_linux_amd64.tar.gz  burrow.out.2020-04-29_10:09:38  CHANGELOG.md  LICENSE  README.md

[root@IIMAPP02 burrow]# cp burrow /usr/bin/

[root@IIMAPP02 burrow]# mkdir /etc/burrow/

[root@IIMAPP02 burrow]# cp config/* /etc/burrow/

修改配置文件默认的配置文件为/etc/burrow/burrow.toml

主要要修改的地方为zk地址和kafka地址   以及日志和pid路径
[root@IIMAPP02 burrow]# cat /etc/burrow/burrow.toml 
[general]
pidfile="/var/run/burrow.pid"
stdout-logfile="burrow.out"
access-control-allow-origin="mysite.example.com"

[logging]
filename="/var/log/burrow.log"
level="info"
maxsize=100
maxbackups=30
maxage=10
use-localtime=false
use-compression=true

[zookeeper]
servers=[ "9.1.8.234:2181", "9.1.8.234:2182", "9.1.8.234:2183" ]
timeout=6
root-path="/burrow"

[client-profile.test]
client-id="burrow-test"
kafka-version="0.10.0"

[cluster.local]
class-name="kafka"
servers=[ "9.1.9.49:9092", "9.1.10.85:9092", "9.1.8.246:9092" ]
client-profile="test"
topic-refresh=120
offset-refresh=30

[consumer.local]
class-name="kafka"
cluster="local"
servers=[ "9.1.9.49:9092", "9.1.10.85:9092", "9.1.8.246:9092" ]
client-profile="test"
group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-).*$"
group-whitelist=""

[consumer.local_zk]
class-name="kafka_zk"
cluster="local"
servers=[ "9.1.8.234:2181", "9.1.8.234:2182", "9.1.8.234:2183" ]
zookeeper-path="/kafka-cluster"
zookeeper-timeout=30
group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-).*$"
group-whitelist=""

[httpserver.default]
address=":8000"

[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1

#[notifier.default]
#class-name="http"
#url-open="http://someservice.example.com:1467/v1/event"
#interval=60
#timeout=5
#keepalive=30
#extras={ api_key="REDACTED", app="burrow", tier="STG", fabric="mydc" }
#template-open="conf/default-http-post.tmpl"
#template-close="conf/default-http-delete.tmpl"
#method-close="DELETE"
#send-close=true
#threshold=1

   启动脚本

[root@IIMAPP02 burrow]# cat /etc/init.d/burrow_script 
#!/bin/bash
#
# Comments to support chkconfig
# chkconfig: - 98 02
# description: Burrow is kafka lag check_program by LinkedIn, Inc. 
#
# Source function library.
. /etc/init.d/functions

### Default variables
prog_name="burrow"
prog_path="/usr/bin/${prog_name}"
pidfile="/var/run/${prog_name}.pid"
options="-config-dir /etc/burrow/"

# Check if requirements are met
[ -x "${prog_path}" ] || exit 1

RETVAL=0

start(){
  echo -n $"Starting $prog_name: "
  #pidfileofproc $prog_name
  #killproc $prog_path
  PID=$(pidofproc -p $pidfile $prog_name)
  #daemon $prog_path $options

  if [ -z $PID  ]; then
    $prog_path $options > /dev/null 2>&1 &
    [ ! -e $pidfile ] && sleep 1
  fi

  [ -z $PID ] && PID=$(pidof ${prog_path})
  if [ -f $pidfile -a -d "/proc/$PID" ]; then
    #RETVAL=$?
    RETVAL=0
    #[ ! -z "${PID}" ] && echo ${PID} > ${pidfile}
    echo_success
    [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name
  else
    RETVAL=1
    echo_failure
  fi

  echo
  return $RETVAL
}

stop(){
  echo -n $"Shutting down $prog_name: "
  killproc -p ${pidfile} $prog_name
  RETVAL=$?
  echo
  [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name
  return $RETVAL
}

restart() {
  stop
  start
}

case "$1" in
  start)
    start
    ;;
  stop)
    stop
    ;;
  restart)
    restart
    ;;
  status)
    status $prog_path
    RETVAL=$?
    ;;
  *)
    echo $"Usage: $0 {start|stop|restart|status}"
    RETVAL=1
esac

exit $RETVAL

chmod +x /etc/init.d/burrow_script

启动并添加开机自启动

[root@IIMAPP02 burrow]# /etc/init.d/burrow_script start
Starting burrow:                                           [确定]

至此burrow 安装完成

测试一下看能否获取到数据

[root@IIMAPP02 zabbix_agentd.d]# curl -s http://9.1.9.49:8000/v3/kafka/local/consumer|jq .
{
  "request": {
    "host": "IIMAPP02",
    "url": "/v3/kafka/local/consumer"
  },
  "consumers": [
    "elasticsearch2"
  ],
  "message": "consumer list returned",
  "error": false
}

使用zabbix监控kafka 消费

1、添加zabbix自定义监控

[root@IIMAPP02 zabbix_agentd.d]# cat userparameter_kafkaconsumer.conf 
UserParameter=kafkaconsumer[*],/etc/zabbix/zabbix_agentd.d/kafka_consumers.sh $1 $2 $3 $4 $5
[root@IIMAPP02 zabbix_agentd.d]# pwd
/etc/zabbix/zabbix_agentd.d

测试脚本自动发现消费者以及topic

[root@IIMAPP02 zabbix_agentd.d]# pwd
/etc/zabbix/zabbix_agentd.d
[root@IIMAPP02 zabbix_agentd.d]# ./kafka_consumers.sh discovery
{"data":[{
  "{#CONSUMER}": "elasticsearch2",
  "{#PARTITION}": 0,
  "{#TOPIC}": "tivoli"
}]}

kafka_consumer.sh 脚本内容

[root@IIMAPP02 zabbix_agentd.d]# cat kafka_consumers.sh 
#!/bin/sh

host1=`hostname -i`
#curl1="curl -s http://$host1:8000/v2/kafka/local/consumer"
curl1="curl -s http://$host1:8000/v3/kafka/local/consumer"
case $1 in
discovery)
           res=(`$curl1 | jq ".consumers[]" | tr -d '"'`)
             for element in ${res[@]}
              do
                 res1=`$curl1/$element/lag | jq 'if .status.group == "'${element}'" then .status.partitions[] | {"{#CONSUMER}":"'${element}'","{#PARTITION}":.partition,"{#TOPIC}":.topic} else 1 end'| jq -s add`
                    if [[ "$res1" != "null" ]]; then
                      ress_all=$ress_all,$res1
                    fi
             done
         ress_all=${ress_all:1}
         echo "{\"data\":[$ress_all]}"
     ;;
     consumer_status)
              res=`$curl1/$2/status | jq ".status.status" | tr -d '"'`
              case $res in
              NOTFOUND)
                echo 0
              ;;
              OK)
                echo 1
              ;;
              WARN)
                echo 2
              ;;
              ERR)
                echo 3
              ;;
              STOP)
                echo 4
              ;;
              STALL)
                echo 5
              ;;
              *) 
                echo $res
              esac
    ;;
    consumer_tottallag)
       res=`$curl1/$2/status | jq .status.totallag`
       echo $res
    ;;
    consumer_offsets)
           case $3 in 
           start)
              res=`$curl1/$2/lag | jq '.status.partitions[]' | jq 'if .topic == "'$5'" then . else null end' | grep -v null | jq 'if .partition == '$4' then .start.offset else null end' | grep -v null`
           ;;
           end)
              res=`$curl1/$2/lag | jq '.status.partitions[]' | jq 'if .topic == "'$5'" then . else null end' | grep -v null | jq 'if .partition == '$4' then .end.offset else null end' | grep -v null`
           ;;
           esac
           echo $res
     ;;
     consumer_lag)           
           case $3 in
           start)
              res=`$curl1/$2/lag | jq '.status.partitions[]' | jq 'if .topic == "'$5'" then . else null end' | grep -v null | jq 'if .partition == '$4' then .start.lag else null end' | grep -v null`
           ;;
           end)
              res=`$curl1/$2/lag | jq '.status.partitions[]' | jq 'if .topic == "'$5'" then . else null end' | grep -v null | jq 'if .partition == '$4' then .end.lag else null end' | grep -v null`
           ;;
           esac
           echo $res
     ;;
     consumer_offsets_status)
             res=`$curl1/$2/lag | jq '.status.partitions[]' | jq 'if .topic == "'$4'" then . else null end' | grep -v null | jq 'if .partition == '$3' then .status else null end' | grep -v null | tr -d '"'`
             case $res in
             NOTFOUND)
               echo 0
             ;;
             OK)
               echo 1
             ;;
             WARN)
               echo 2
             ;;
             ERR)
               echo 3
             ;;
             STOP)
               echo 4
             ;;
             STALL)
               echo 5
             ;;
             *)
             echo $res
             ;;
             esac
     ;;
     *)
           echo "Try
./kafka_consumers.sh discovery - use for discovery consumers and partitions
./kafka_consumers.sh consumer_status {#CONSUMER}
./kafka_consumers.sh consumer_tottallag {#CONSUMER}
./kafka_consumers.sh consumer_offsets {#CONSUMER} start|end {#PARTITION} {#TOPIC}
./kafka_consumers.sh consumer_lag {#CONSUMER} start|end {#PARTITION} {#TOPIC}
"
           ;;

esac

导入模板zbx_templates_kafkaconsumers.xml  和值映射模板 zbx_valuemaps_kafkaconsumers.xml

使用zabbix测试一下

root@ *zabbix-master* @yxsjfxapp02:/root# zabbix_get -I 9.1.8.198 -s 9.1.9.49 -p 10050 -k "kafkaconsumer[discovery]"
{"data":[{
  "{#CONSUMER}": "elasticsearch2",
  "{#PARTITION}": 0,
  "{#TOPIC}": "tivoli"
}]}

然后关联kafka模板到要监控的主机

 

github.com/helli0n/kafka-monitoring

https://blog.51cto.com/professor/2119071

 

 

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!