When monitoring a Kafka cluster, there are loads of metrics available and it may be hard to know what to look at. There are a lot of guides available that addresses different ways of doing this in great detail.
Here is our lightweight overview on what we usually look for at Irori. We usually scrape metrics from the components using Prometheus and build dashboards to present results in Grafana.
Primary Cluster Health
Make sure that the cluster is up and running and that no brokers or zookeepers are missing. We usually have a look at the zookeeper leader and quorum size, that there is an active Kafka controller and that there are no underreplicated partitions.
Long-term Cluster Sizing
Monitor the Kafka resource consumption to be able to proactively scale the cluster by adding more nodes or resources. We look at disk space, disk IO, network IO and memory/cpu utilization on the nodes to determine proper action.
Producer/consumer rates can be used for alerting when data is flowing with a lower or higher rate than normal, or not at all. In the same way, measuring consumer-lag can be used to identify problems with specific downstream consumers.
Consider this a starting point that needs to be iterated to fit your specific needs. Different usecases and organizational responsibilities require different metrics.