Kafka Connect Alerting Options?

后端未结

关注

 5  548

Are there any alerting options for scenarios where a Kafka Connect Connector or a Connector task fails or experiences errors?

We have Kafka Connect running, it runs

相关标签:

5条回答

醉梦人生

2021-01-13 11:01

(I still can't comment so to respond to clay's answer...)

NOTE: There is a bug in the JMX metrics for task/connector status (at time of posting: 5/11/2020)

1) When a task fails, it's status metrics dissapear. This is a known issue and there is a fix in progress. A Jira can be found here and PR can be found here.

2) Don't use the Connector metric to monitor the status of the tasks. The Connector can show up as running fine but the tasks can be in a failure state, you need to monitor the tasks directly. This is mentioned in Confluent's Connector monitoring tips where it says:

In most cases, connector and task states will match, though they may be different for short periods of time when changes are occurring or if tasks have failed. For example, when a connector is first started, there may be a noticeable delay before the connector and its tasks have all transitioned to the RUNNING state. States will also diverge when tasks fail since Connect does not automatically restart failed tasks.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2021-01-13 11:02
One option is to use Kafka Connect's REST API to check the health of the worker and the status of the connectors. This approach is simple to automate using simple scripts or many monitoring systems. It works with the standalone worker and distributed workers, though in the latter case you can make requests to any Kafka Connect worker in the cluster.

If you want to check the health of all the connectors, the first step is to get the list of deployed connectors:
```
GET /connectors
```
That returns a JSON array of connector names. For each of those, issue a request to check the status of the named connector:
```
GET /connectors/(string: name)/status
```
The response will include status information about the connector and its tasks. For example, the following shows a connector that is running two tasks, with one of those tasks still running and the other having failed with an error:
```
HTTP/1.1 200 OK

{
    "name": "hdfs-sink-connector",
    "connector": {
        "state": "RUNNING",
        "worker_id": "fakehost:8083"
    },
    "tasks":
    [
        {
            "id": 0,
            "state": "RUNNING",
            "worker_id": "fakehost:8083"
        },
        {
            "id": 1,
            "state": "FAILED",
            "worker_id": "fakehost:8083",
            "trace": "org.apache.kafka.common.errors.RecordTooLargeException\n"
        }
    ]
}
```
These are just a sampling of what the REST API allows you to do.
0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2021-01-13 11:07

Since this post was written/answered, Kafka Connect began providing its own official metrics. The Apache Kafka Connect provides metrics in legacy JMX format.

If you use the Confluent Kafka Connect Helm Charts (https://github.com/confluentinc/cp-helm-charts/tree/master/charts/cp-kafka-connect), they include a Prometheus metrics exporter.

I monitor and alert on cp_kafka_connect_connect_connector_metrics{status="running"} from the Confluent Helm Chart Prometheus chart, but there are many variations to that.

Using the official Kafka Connect metrics is generally preferable for any automated monitoring + alerting setup. This option wasn't available back when this post was written + answered.

FYI, Kafka still doesn't expose lag metrics, so you still need third party options to monitor and alert on lag.

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2021-01-13 11:11

I know that this is a really old question, so when we ran into a similar issue as we use Kafka Connect really heavily, and as its very difficult to individually monitor each connectors especially when you are looking at managing more than 150+ connectors.

Hence we have written a small Kotlin based application, which accepts a config.json where you can specify the cluster config and if smtp config is specified, it will keep on polling the cluster based on a specified recursion interval specified and will send mail based alerts.

If it fits your use-case, please do use and do raise issues in-case you face any.

The link to the repo is as follows: https://github.com/gunjdesai/kafka-connect-monit

The image is also pushed on Docker Hub and you run it directly using the following command.

docker run -d -v <location-of-your-config-file.json>:/home/code/config.json gunjdesai/kafka-connect-monit

Hope this maybe helpful to you

0 讨论(0)
发布评论:

提交评论
- 加载中...

青春惊慌失措

2021-01-13 11:18

Building on what Randall says, this shell script uses the Confluent CLI to show the state of all connectors and tasks. You could use that as the basis of alerting:

Robin@asgard02 ~/c/confluent-3.3.0> ./bin/confluent status connectors| \
                                    jq '.[]'| \
                                    xargs -I{connector} ./bin/confluent status {connector}| \
                                    jq -c -M '[.name,.connector.state,.tasks[].state]|join(":|:")'| \
                                    column -s : -t| \
                                    sed 's/\"//g'| \
                                    sort

file-sink-mysql-foobar       |  RUNNING  |  RUNNING
jdbc_source_mysql_foobar_01  |  RUNNING  |  RUNNING

0 讨论(0)