Are there any alerting options for scenarios where a Kafka Connect Connector or a Connector task fails or experiences errors?
We have Kafka Connect running, it runs
(I still can't comment so to respond to clay's answer...)
NOTE: There is a bug in the JMX metrics for task/connector status (at time of posting: 5/11/2020)
1) When a task fails, it's status metrics dissapear. This is a known issue and there is a fix in progress. A Jira can be found here and PR can be found here.
2) Don't use the Connector metric to monitor the status of the tasks. The Connector can show up as running fine but the tasks can be in a failure state, you need to monitor the tasks directly. This is mentioned in Confluent's Connector monitoring tips where it says:
In most cases, connector and task states will match, though they may be different for short periods of time when changes are occurring or if tasks have failed. For example, when a connector is first started, there may be a noticeable delay before the connector and its tasks have all transitioned to the RUNNING state. States will also diverge when tasks fail since Connect does not automatically restart failed tasks.
One option is to use Kafka Connect's REST API to check the health of the worker and the status of the connectors. This approach is simple to automate using simple scripts or many monitoring systems. It works with the standalone worker and distributed workers, though in the latter case you can make requests to any Kafka Connect worker in the cluster.
If you want to check the health of all the connectors, the first step is to get the list of deployed connectors:
GET /connectors
That returns a JSON array of connector names. For each of those, issue a request to check the status of the named connector:
GET /connectors/(string: name)/status
The response will include status information about the connector and its tasks. For example, the following shows a connector that is running two tasks, with one of those tasks still running and the other having failed with an error:
HTTP/1.1 200 OK
{
"name": "hdfs-sink-connector",
"connector": {
"state": "RUNNING",
"worker_id": "fakehost:8083"
},
"tasks":
[
{
"id": 0,
"state": "RUNNING",
"worker_id": "fakehost:8083"
},
{
"id": 1,
"state": "FAILED",
"worker_id": "fakehost:8083",
"trace": "org.apache.kafka.common.errors.RecordTooLargeException\n"
}
]
}
These are just a sampling of what the REST API allows you to do.
Since this post was written/answered, Kafka Connect began providing its own official metrics. The Apache Kafka Connect provides metrics in legacy JMX format.
If you use the Confluent Kafka Connect Helm Charts (https://github.com/confluentinc/cp-helm-charts/tree/master/charts/cp-kafka-connect), they include a Prometheus metrics exporter.
I monitor and alert on cp_kafka_connect_connect_connector_metrics{status="running"}
from the Confluent Helm Chart Prometheus chart, but there are many variations to that.
Using the official Kafka Connect metrics is generally preferable for any automated monitoring + alerting setup. This option wasn't available back when this post was written + answered.
FYI, Kafka still doesn't expose lag metrics, so you still need third party options to monitor and alert on lag.
I know that this is a really old question, so when we ran into a similar issue as we use Kafka Connect really heavily, and as its very difficult to individually monitor each connectors especially when you are looking at managing more than 150+ connectors.
Hence we have written a small Kotlin based application, which accepts a config.json
where you can specify the cluster config and if smtp config is specified, it will keep on polling the cluster based on a specified recursion interval specified and will send mail based alerts.
If it fits your use-case, please do use and do raise issues in-case you face any.
The link to the repo is as follows: https://github.com/gunjdesai/kafka-connect-monit
The image is also pushed on Docker Hub and you run it directly using the following command.
docker run -d -v <location-of-your-config-file.json>:/home/code/config.json gunjdesai/kafka-connect-monit
Hope this maybe helpful to you
Building on what Randall says, this shell script uses the Confluent CLI to show the state of all connectors and tasks. You could use that as the basis of alerting:
Robin@asgard02 ~/c/confluent-3.3.0> ./bin/confluent status connectors| \
jq '.[]'| \
xargs -I{connector} ./bin/confluent status {connector}| \
jq -c -M '[.name,.connector.state,.tasks[].state]|join(":|:")'| \
column -s : -t| \
sed 's/\"//g'| \
sort
file-sink-mysql-foobar | RUNNING | RUNNING
jdbc_source_mysql_foobar_01 | RUNNING | RUNNING