
Kafka broker monitoring
Table of contents
Brokers are the operational core of a Kafka cluster. They receive produce requests, serve fetch requests, manage partition replicas, and coordinate leader election. When a broker degrades, every producer and consumer connected to it is affected, often before your monitoring stack surfaces anything useful.
This article covers broker-level monitoring specifically: the JMX metrics to watch, thresholds to alert on, process monitoring scripts, and how to diagnose the most common failure modes. For cluster-wide concerns, consumer lag, and producer delivery guarantees, refer to the separate guides on Kafka cluster monitoring, Kafka consumer monitoring, and Kafka producer monitoring. For a full overview of Kafka observability, see the Kafka monitoring guide.
Key takeaways
- Kafka brokers expose metrics via JMX by default; Prometheus users need the JMX Exporter agent to scrape them.
UnderReplicatedPartitionsandActiveControllerCountare the two most critical metrics for cluster health.- Broker monitoring covers three layers: the Kafka process, the JVM, and the host OS. All three require attention.
- A lightweight process monitoring script can detect broker failures faster than most monitoring stacks.
- Most broker incidents trace to a small set of root causes: disk saturation, leader imbalance, network thread exhaustion, or ISR instability.
What is Kafka broker monitoring?
A Kafka broker handles several concurrent responsibilities: writing incoming messages to disk, serving read requests from consumers, replicating partition data to follower brokers, and participating in leader election. Each of these activities has distinct failure modes.
Broker health is not the same as cluster health. A single degraded broker can produce replica lag, leader imbalance, and consumer fetch failures without the cluster appearing unavailable. Topics remain accessible on unaffected brokers, but the partitions whose leaders or followers are on the degraded broker will silently degrade. Identifying which broker is under pressure, and why, is the purpose of broker-level monitoring.
Monitoring spans three layers:
- Kafka process metrics (exposed via JMX): replication state, request throughput, controller status, failure rates.
- JVM metrics: heap usage, garbage collection pause duration, open file descriptors.
- Host OS metrics: disk capacity, disk I/O throughput, network utilization.
Consumer lag, producer delivery guarantees, and cluster-wide partition distribution are covered in separate articles. This article focuses on the broker process and what runs immediately around it.
How Kafka brokers expose metrics
JMX (default)
Kafka exposes metrics as JMX MBeans by default. Server-side broker metrics use Yammer Metrics internally; native Java clients use Kafka's own metrics registry. Both project their measurements onto MBeans hosted in the JVM's MBean server.
Remote JMX access is disabled by default. To enable it, set the JMX_PORT environment variable when starting the broker:
export JMX_PORT=9999
In production, raw JMX access is a security risk. Because JMX supports remote method invocation (RMI), an unauthenticated client can invoke operations on MBeans, including modifying runtime configurations and triggering JVM shutdowns. Use KAFKA_JMX_OPTS to enforce authentication and encryption:
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=true \
-Dcom.sun.management.jmxremote.ssl=true \
-Dcom.sun.management.jmxremote.port=9999 \
-Dcom.sun.management.jmxremote.rmi.port=9999 \
-Dcom.sun.management.jmxremote.password.file=/etc/kafka/jmxremote.password \
-Dcom.sun.management.jmxremote.access.file=/etc/kafka/jmxremote.access"
Tools such as jconsole, jmxterm, and kafka-run-class.sh kafka.tools.JmxTool can query JMX directly once access is enabled.
Prometheus via JMX Exporter
The standard approach for Prometheus-based stacks is to run the Prometheus JMX Exporter as a Java agent alongside the broker JVM.
Running the exporter in local agent mode (rather than as a standalone HTTP server) is preferred. It avoids the serialization overhead of remote RMI polling and captures additional host-level process metrics including JVM CPU and memory utilization.
The agent translates JMX MBeans to Prometheus metrics and serves them on an HTTP endpoint, typically on port 7071. It requires a YAML configuration file that maps MBean paths to Prometheus metric names. A community-maintained configuration for Kafka is available in the JMX Exporter GitHub repository. Keep this configuration file in version control alongside your broker configuration and review it when upgrading Kafka versions.
Kafka metrics reporters (pluggable)
Kafka supports custom MetricsReporter implementations via the metric.reporters setting in server.properties. This allows metrics to be pushed directly to external systems without going through JMX or Prometheus. Confluent Platform ships a reporter that sends metrics to Confluent Control Center. Vendors such as Datadog and New Relic provide their own reporter implementations as well.
KRaft mode: In KRaft mode (GA since Kafka 3.3), some ZooKeeper-related MBeans are removed and several KRaft-specific MBeans appear in their place, including FencedBrokerCount and LastAppliedRecordLagMs. If you are migrating from ZooKeeper mode, or if you have inherited dashboards built against an older cluster, audit your MBean paths before deploying them against a KRaft cluster.
Key metrics to monitor
The sections below cover metrics across all three layers. JMX MBean paths are given for direct JMX access; if you are using Prometheus, metric names follow the pattern produced by the JMX Exporter configuration.
Replication health metrics
These are the highest-priority metrics on any broker. A non-zero value in the first three rows means data availability is at risk.
Controller metrics
There is always exactly one active controller in a Kafka cluster. These metrics confirm that invariant holds and that the controller is performing well.
Request handling metrics
These metrics indicate whether the broker is keeping up with its request load.
Latency sub-phases. TotalTimeMs is the sum of five phases: time waiting in the request queue (RequestQueueTimeMs), local processing time on the partition leader (LocalTimeMs), replication wait time for acks=all producers (RemoteTimeMs), time waiting in the response queue (ResponseQueueTimeMs), and time to transmit the response to the client (ResponseSendTimeMs). When P99 TotalTimeMs is elevated, checking each sub-phase narrows the root cause. High RemoteTimeMs points to replication pressure. High LocalTimeMs typically indicates disk write saturation or message format conversion overhead.
Throughput and I/O metrics
These metrics give a baseline view of broker load and are useful for capacity planning.
Log and disk metrics
JVM metrics
JVM pressure is a common root cause of ISR instability and request latency spikes. If a broker's JVM pauses long enough during garbage collection, the broker fails to send heartbeats to the controller. This triggers a timeout and forces partition leadership reassignment, which looks from the outside like an ISR shrink or elevated election rate.
Heap sizing. Kafka uses a hybrid memory model: a small JVM heap handles partition metadata, indexes, and producer state, while the OS page cache holds hot log segment data. Allocating too much to the JVM heap reduces the page cache, which forces consumer reads to disk. On a 64 GB host, a typical configuration is 6-12 GB for the JVM heap and the remainder for the OS page cache. A rough rule of thumb is 1-2 MB of heap per active partition replica hosted on the broker.
Broker process monitoring script
The JMX metrics above require the broker to be running and reachable. A process-level check catches failures that happen before or outside the metrics stack: a broker crash, a port that is not listening, a startup failure, or a JVM that is hung during initialization. The scripts below complement JMX monitoring rather than replacing it.
Check the broker process is running (jps / ps)
jps (JVM process status, included in the JDK) lists running JVM processes by main class. For Apache Kafka, the main class is kafka.Kafka.
#!/bin/bash
BROKER_CLASS="kafka.Kafka"
if jps -l | grep -q "${BROKER_CLASS}"; then
echo "Broker process is running"
exit 0
else
echo "ERROR: Kafka broker process not found"
exit 1
fi
If jps is not available, fall back to ps:
#!/bin/bash
if ps aux | grep -q '[k]afka.Kafka'; then
echo "Broker process is running"
exit 0
else
echo "ERROR: Kafka broker process not found"
exit 1
fi
Confluent Platform: The main class for Confluent Server is io.confluent.kafka.server.ConfluentServer. Adjust the grep pattern accordingly.
A running process does not mean the broker is healthy. It may be in an error state or hung waiting on disk or network. Use this check as a first-line detector, not a health indicator.
Check the broker port is accepting connections
A process check confirms the JVM is alive, but not that it is accepting Kafka connections. Check that the listener port (default 9092) is open and accepting connections:
#!/bin/bash
BROKER_HOST="localhost"
BROKER_PORT=9092
TIMEOUT=5
if nc -z -w "${TIMEOUT}" "${BROKER_HOST}" "${BROKER_PORT}" 2>/dev/null; then
echo "Broker port ${BROKER_PORT} is accepting connections"
exit 0
else
echo "ERROR: Broker not accepting connections on ${BROKER_HOST}:${BROKER_PORT}"
exit 1
fi
In environments without netcat, use the /dev/tcp bash built-in:
#!/bin/bash
BROKER_HOST="localhost"
BROKER_PORT=9092
if (echo > /dev/tcp/${BROKER_HOST}/${BROKER_PORT}) 2>/dev/null; then
echo "Broker port ${BROKER_PORT} is accepting connections"
exit 0
else
echo "ERROR: Broker not accepting connections on ${BROKER_HOST}:${BROKER_PORT}"
exit 1
fi
JMX-based health check script
A port check validates TCP connectivity but not broker health. Use kafka-run-class.sh with kafka.tools.JmxTool to query UnderReplicatedPartitions directly. This works without additional tooling if the JDK and Kafka binaries are available.
#!/bin/bash
JMX_HOST="localhost"
JMX_PORT=9999
MBEAN="kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions"
result=$(kafka-run-class.sh kafka.tools.JmxTool \
--jmx-url "service:jmx:rmi:///jndi/rmi://${JMX_HOST}:${JMX_PORT}/jmxrmi" \
--object-name "${MBEAN}" \
--attributes Value \
--one-time true 2>/dev/null | tail -n 1 | awk -F',' '{print $NF}')
if [ -z "${result}" ]; then
echo "ERROR: Could not retrieve JMX metric"
exit 2
fi
if [ "${result}" -gt 0 ]; then
echo "WARNING: UnderReplicatedPartitions = ${result}"
exit 1
else
echo "OK: UnderReplicatedPartitions = 0"
exit 0
fi
This gives a more meaningful health signal than a port check. A broker can accept TCP connections while being in a degraded replication state.
Kafka Admin API check
kafka-broker-api-versions.sh is a lightweight liveness check that validates the Kafka protocol layer, not just TCP connectivity:
kafka-broker-api-versions.sh --bootstrap-server localhost:9092
If the broker is healthy, this returns the list of supported API versions. If the protocol handshake fails or the broker is not responsive, it exits with an error. Use this as a quick manual check or wrap it in a script for automated monitoring. It is a useful final check after a rolling restart to confirm each broker has rejoined the cluster before proceeding to the next.
Broker monitoring tools
Prometheus and Grafana
The most common self-hosted stack. The Prometheus JMX Exporter scrapes broker JMX metrics, Prometheus stores them as time-series data, and Grafana renders dashboards. Community dashboards for Kafka are available on Grafana Labs, though quality and metric coverage vary. Maintaining a complete and accurate Kafka dashboard requires ongoing attention as cluster topology and Kafka versions change.
Managed monitoring platforms (Datadog, New Relic)
Agent-based collection where vendor agents handle JMX scraping and forwarding. Both Datadog and New Relic provide out-of-the-box Kafka dashboards and alert policies, which reduces setup time compared to the self-hosted Prometheus stack. Cost scales with host count and metrics volume. Default alert thresholds may need adjustment for high-throughput deployments.
Confluent Control Center
Available in Confluent Platform, not Apache Kafka. Provides deep integration with Confluent-specific metrics and a built-in interface for broker health, topic management, and consumer lag. Less useful if you are running vanilla Apache Kafka.
Kpow by Factor House
Kpow is purpose-built for Kafka observability. It surfaces broker health, partition state, ISR status, and throughput metrics without requiring a separate Prometheus stack or Grafana instance, and it runs inside your own network. Try it free for 30 days.
Alerting strategy for broker monitoring
The goal is to page on-call when there is an active data risk and route everything else to a dashboard or asynchronous channel. Static thresholds on percentage-based metrics can generate false alarms during quiet periods: a single failed request in a low-traffic window can push an error rate to 50% and trigger a high-severity alert despite having no real impact. Organize alerts by severity:
Recommended thresholds based on the Apache Kafka documentation and operational guidance:
Link every production alert to a runbook. When an alert fires, the on-call engineer should be able to identify the root cause and begin remediation without first looking up what the metric means. A useful runbook includes the metric's JMX MBean source, related broker log entries to check, and step-by-step remediation for the most common causes of that alert.
Common broker issues and how to diagnose them
Kafka broker monitoring best practices
- Monitor all three layers: Kafka process metrics, JVM metrics, and host OS metrics. A problem at any layer will eventually surface in the others.
- Alert on
UnderReplicatedPartitionsandActiveControllerCountbefore everything else. These are the leading indicators of cluster health degradation. - Set disk usage alerts at 70% and 85%. Kafka stops accepting produce requests when the log directory fills completely and does not degrade gracefully before that threshold.
- Use a lightweight process check script as a first-line detector. It can catch broker failures before your metrics pipeline does, particularly in the window between a crash and the first missed metrics scrape.
- Keep your JMX Exporter configuration in version control and treat it as part of your Kafka infrastructure. Changes to it affect what your dashboards and alerts can observe.
- Do not rely on consumer lag alone as a broker health signal. Consumer lag is a symptom; broker metrics tell you the cause.
- In KRaft mode, audit your dashboard configuration during and after the migration. ZooKeeper-related MBeans are removed; KRaft-specific MBeans such as
FencedBrokerCountandLastAppliedRecordLagMsappear in their place. - Link every production alert to a runbook that includes the metric's JMX MBean source, related log entries to check, and remediation steps for the most common causes.
Monitor Kafka brokers with Factor House
Kpow surfaces broker health, partition state, ISR status, and throughput metrics from inside your own network. You get visibility into the metrics covered in this article without running a separate Prometheus stack or maintaining Grafana dashboards. Give it a try with a free 30-day trial.