
Kafka cluster monitoring
Table of contents
Kafka clusters are typically monitored from the broker up: process health checks, JVM metrics, disk alerts, per-broker request latencies. That instrumentation is necessary, but it has a blind spot. A broker can report normal metrics in isolation while the cluster as a whole is in a degraded state. Partition leadership can be unbalanced. Replication can be falling behind across the fleet. The controller can fail silently. None of these conditions are visible from a single broker's perspective.
This article covers cluster-level monitoring: the aggregated, cross-broker view that reveals how the cluster is functioning as a unit. It covers the metrics that matter at this layer, how to collect them across a broker fleet, how to alert on them effectively, and how to diagnose the most common cluster-level failures.
Per-broker internals — request thread pools, JVM heap, disk I/O — are covered in the Kafka broker monitoring article. Consumer-side lag monitoring is covered in the Kafka consumer monitoring article. See the Kafka monitoring guide for the full picture.
Key takeaways
- Cluster monitoring is different from broker monitoring: it gives you an aggregated view of replication health, partition distribution, and control plane state across all brokers.
ActiveControllerCount,OfflinePartitionsCount, andUnderReplicatedPartitionsare the three cluster-level metrics that most directly signal data availability risk.- In KRaft mode (Kafka 3.3+ GA, default from 4.0), the control plane metrics change — ZooKeeper-era metrics are replaced by KRaft-native equivalents.
- A cluster health check script that verifies five or six key metrics is usually enough to catch the most serious problems; the value is in running it consistently and alerting on deviations.
- Capacity planning at the cluster level is about trend monitoring, not single-point thresholds — watch bytes-in per broker, disk utilisation rate, and partition count per broker over time.
What is Kafka cluster monitoring?
Cluster monitoring is the practice of observing the aggregated state of all brokers rather than the internals of any single one. The distinction matters in practice.
Broker monitoring tells you whether an individual broker is healthy: its request thread utilisation, JVM heap pressure, disk throughput, and local ISR state. Cluster monitoring tells you whether the cluster as a whole is functioning: whether replication is consistent across the fleet, whether partition leadership is balanced, whether the controller is in a valid state, and whether the cluster has capacity headroom.
Both layers are necessary. Cluster health checks can miss per-broker pathologies that are not yet visible at the aggregate level. A JVM GC pause on one broker, for example, may not immediately register as a replication deficit on the cluster health dashboard. Conversely, per-broker monitoring alone will not surface a partition imbalance, a controller failure, or a sustained ISR shrink rate that only becomes meaningful when measured across the fleet.
The same JMX interface exposes data at both levels. The difference is in how you query and aggregate it.
Key cluster-level metrics
Cluster-level metrics fall into two categories: metrics that are properties of the cluster as a whole (controller state, partition distribution), and metrics that are only meaningful when aggregated across brokers (replication health, throughput). The four groups below cover both.
Control plane health
The controller is the broker responsible for partition leader elections and cluster metadata management. There is always exactly one active controller in a functioning cluster. Deviations are critical.
KRaft mode: In KRaft mode (Kafka 3.3+, default from 4.0), the following metrics replace ZooKeeper-era control plane telemetry. Verify MBean paths against the Kafka version in use.
A KRaft quorum of size n (typically an odd number: 3 or 5) requires a strict majority of floor(n/2) + 1 active nodes to elect a leader and accept metadata commits. If the active quorum drops below this threshold, the control plane enters a read-only mode. Brokers continue to process produce and consume requests for existing partitions, but topic creation, partition reassignment, and ISR state modifications are blocked. If a partition leader fails while the quorum is unavailable, client requests to that partition will time out.
Replication fleet health
These are the most operationally significant cluster-level metrics. They reflect whether data is being replicated as configured across all brokers, not just on one.
Note that UnderReplicatedPartitions is technically a per-broker metric, but it only becomes useful as a cluster-level aggregate. A URP count of zero on one broker says nothing about the remaining brokers. Sum it across the fleet.
Partition distribution
These metrics indicate whether work is balanced evenly across the broker fleet. Imbalances affect both throughput and fault tolerance.
Uneven partition or leader counts after a broker restart or rolling upgrade are common and typically self-correct via the preferred leader election mechanism. Alert only on sustained imbalance.
Throughput and capacity signals
These metrics inform both capacity planning and short-term incident response.
Cluster vs broker monitoring: where the line sits
The cluster and broker layers are not always cleanly separable. Several metrics appear in per-broker JMX output but only carry meaning when interpreted at the cluster level.
UnderReplicatedPartitions is the clearest example: it is exposed per broker, but aggregating it across the fleet gives you the total replication deficit. A single broker reporting zero URPs is not informative if another broker is lagging. ActiveControllerCount is only interpretable as a cluster-wide sum — a value of 1 on one broker is normal; a cluster-wide sum of 0 or 2 is critical. Individual broker throughput metrics (BytesInPerSec per broker) are per-broker values that feed cluster-level capacity planning by revealing which brokers are carrying disproportionate load.
For replication and controller metrics, always aggregate. For throughput, collect per-broker and compare across the fleet. The Kafka broker monitoring article covers the internals — request thread saturation, JVM heap, disk flush latency — that complement the cluster view covered here.
Multi-broker observability setup
The collection challenge
JMX is a per-process interface. To get a cluster-wide picture, you need to scrape every broker's JMX endpoint and aggregate the results centrally. In a three-broker cluster this is manageable manually; in a fleet of dozens, service discovery and automated aggregation become essential.
The most common production approach is Prometheus with the JMX Exporter, because it handles service discovery natively and the aggregation logic lives in PromQL.
Prometheus and JMX Exporter
Run the JMX Exporter as an in-process Java agent on each broker. Running it as an agent rather than a remote poller avoids authentication overhead and reduces JVM thread overhead compared to the remote polling approach:
KAFKA_OPTS="-javaagent:/opt/prometheus/jmx_prometheus_javaagent-0.16.1.jar=7071:/etc/kafka/kafka-jmx-config.yml"
The agent exposes Kafka's JMX MBeans as Prometheus-formatted metrics on an HTTP endpoint (typically port 7071). Configure Prometheus with a scrape job that discovers all broker endpoints. In Kubernetes, a ServiceMonitor resource handles this automatically; in bare-metal or VM deployments, use static targets in prometheus.yml.
Aggregation across brokers happens in PromQL. To get the cluster-wide under-replicated partition count:
sum(kafka_server_replicamanager_underreplicatedpartitions)
Kafka-specific JMX Exporter YAML configuration files are maintained by the community. The Bitnami and confluentinc examples on GitHub are widely used starting points and include pre-built allowlists for the metrics covered in this article.
Other collection approaches
Datadog: Uses JMXFetch via the Agent daemon. Autodiscovery maps JMX ports to integrations automatically. The default cap of 350 metrics per instance means you will need to configure which metrics are collected carefully for larger clusters.
Kpow (Factor House): Uses the native Kafka Admin Client and Consumer APIs rather than JMX — no sidecar agent or broker-side configuration changes required.
If you're evaluating monitoring tools, the Kafka observability tools comparison article covers the trade-offs in more detail.
Cluster health check script
A short Python script that checks six cluster-level conditions gives you a pass/fail signal suitable for cron jobs, CI pipelines, or on-call runbooks. The value is in running it consistently on a schedule and routing its output to your alerting channel.
What the script checks
ActiveControllerCountsum equals exactly 1OfflinePartitionsCountequals 0UnderReplicatedPartitionssum equals 0UnderMinIsrPartitionCountequals 0- ISR shrink rate is below a configurable threshold
- Leader count distribution — no single broker holds more than a configurable percentage of all leaders
Implementation
The Kafka Admin Client (kafka-python or confluent-kafka) exposes cluster metadata directly without requiring JMX access. Use AdminClient.describe_cluster() to retrieve broker and controller state, and list_topics() with topic metadata to enumerate partition and leader assignments.
UnderReplicatedPartitions and UnderMinIsrPartitionCount are not exposed via the Admin API; for those, query the JMX Exporter HTTP endpoint if Prometheus is already deployed, or fall back to jmxterm.
from kafka import KafkaAdminClient
import requests
BROKERS = ["broker1:9092", "broker2:9092", "broker3:9092"]
JMX_EXPORTER_HOSTS = ["broker1:7071", "broker2:7071", "broker3:7071"]
LEADER_SKEW_THRESHOLD = 0.20
def check_active_controller(admin):
cluster_meta = admin.describe_cluster()
controller = cluster_meta.controller
if controller is None:
return False, "No active controller"
return True, f"Controller: broker {controller.id}"
def check_urp_from_jmx(hosts):
total_urp = 0
for host in hosts:
try:
r = requests.get(f"http://{host}/metrics", timeout=5)
for line in r.text.splitlines():
if (
"kafka_server_replicamanager_underreplicatedpartitions" in line
and not line.startswith("#")
):
total_urp += float(line.split()[-1])
except Exception as e:
print(f" Warning: could not reach {host}: {e}")
return total_urp
def check_leader_skew(admin):
topics = admin.list_topics()
leader_counts = {}
for topic_metadata in topics.topics.values():
for partition in topic_metadata.partitions.values():
leader = partition.leader
leader_counts[leader] = leader_counts.get(leader, 0) + 1
if not leader_counts:
return True, "No partitions"
counts = list(leader_counts.values())
skew = (max(counts) - min(counts)) / max(counts)
ok = skew <= LEADER_SKEW_THRESHOLD
return ok, f"Leader skew: {skew:.1%} (threshold {LEADER_SKEW_THRESHOLD:.0%})"
def run_health_check():
admin = KafkaAdminClient(bootstrap_servers=BROKERS)
results = []
ok, msg = check_active_controller(admin)
results.append(("ActiveControllerCount", ok, msg))
urp = check_urp_from_jmx(JMX_EXPORTER_HOSTS)
results.append(("UnderReplicatedPartitions", urp == 0, f"URP count: {int(urp)}"))
ok, msg = check_leader_skew(admin)
results.append(("LeaderSkew", ok, msg))
admin.close()
print("Kafka cluster health check")
all_ok = True
for check, passed, detail in results:
status = "PASS" if passed else "FAIL"
print(f" [{status}] {check}: {detail}")
if not passed:
all_ok = False
return all_ok
if __name__ == "__main__":
import sys
sys.exit(0 if run_health_check() else 1)
Limitations
- The script reflects point-in-time state. A transient ISR shrink during a rolling restart will register as a failure unless you add a wait-and-recheck delay for replication metrics.
- It does not replace continuous time-series monitoring — it is a runbook tool, not a substitute for alerting.
- JVM and OS metrics are not covered here; those are addressed in the broker monitoring article.
The consumer monitoring article contains a consumer lag script that complements this one. Together they cover the three main health dimensions: cluster-wide replication and control plane state (this article), per-broker internals (broker monitoring), and consumer group lag (consumer monitoring).
Capacity planning using cluster metrics
Capacity planning at the cluster level is about detecting trends early enough to act before they become incidents. The signals below are leading indicators — most warrant investigation and planning rather than immediate paging.
When to add brokers
The primary signals that a new broker is needed:
- BytesInPerSec per broker is consistently above 70% of the broker's network or disk write capacity. A useful rule of thumb is to size for 3x peak traffic to allow for replication overhead and burst headroom. Replication multiplies egress by the replication factor: a cluster with a replication factor of 3 and 1 Gbps of ingest generates roughly 3 Gbps of outbound replication traffic.
- RequestHandlerAvgIdlePercent is below 20% on multiple brokers over a sustained period, indicating that the request handler thread pool is saturated. This is typically caused by slow disk flushes, high JVM GC pauses, or high concurrent replication load from lagging followers.
- Disk utilisation is growing at a rate that will exhaust available space within your retention window, with no further compression or retention policy levers available.
- PartitionCount per broker is becoming significantly uneven after broker additions or failures — partition reassignment may be needed before adding a new broker.
Partition count planning
Partitions are the unit of parallelism in Kafka. Too few limits throughput; too many increases controller overhead and replication cost.
Track total partition count and partition count per broker. The controller's EventQueueSize rising during periods of high topic creation activity is a signal that the cluster is approaching the limits of what the controller can manage comfortably.
For older Kafka versions (pre-2.6), a commonly cited practical limit is approximately 4,000 partitions per broker, though this varies considerably with hardware and workload. With KRaft, the practical limit is significantly higher — the consolidated metadata management architecture removes the ZooKeeper bottleneck that was the primary constraint in earlier versions. Consult the release notes and benchmark reports for the specific version you are running.
Replication factor auditing
Topics created with replication.factor=1 in a multi-broker cluster represent a silent durability risk. A single broker failure takes those partitions offline with no replication fallback.
Audit replication factors on a schedule using kafka-topics.sh --describe or the Admin API:
from kafka import KafkaAdminClient
admin = KafkaAdminClient(bootstrap_servers=["broker1:9092"])
topic_names = list(admin.list_topics())
topics = admin.describe_topics(topic_names)
for t in topics:
for partition in t.partitions.values():
if len(partition.replicas) < 3:
print(f"{t.topic} partition {partition.partition}: "
f"replication factor {len(partition.replicas)}")
admin.close()
Retention and storage forecasting
Monitor disk utilisation per broker and extrapolate the growth rate from the past 7 and 30 days. BytesInPerSec is the primary driver of storage growth once you account for the replication factor and compression codec.
If storage is growing faster than expected, the first lever to check is log retention settings — both time-based (retention.ms) and size-based (retention.bytes). Reducing retention decreases storage pressure but can break consumers that have fallen behind. Document the trade-off before changing retention on production topics.
Alerting strategy for cluster-level monitoring
Not all metrics warrant the same response. Structuring alerts into two tiers avoids alert fatigue and ensures the right response time for each condition.
Critical alerts (page on-call immediately)
Working Prometheus alert rules for the two most critical conditions:
groups:
- name: kafka_cluster_alerts
rules:
- alert: KafkaOfflinePartitions
expr: kafka_controller_offline_partitions_count > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Kafka offline partitions detected"
description: "{{ $value }} offline partitions on {{ $labels.instance }}. Read and write requests are blocked."
runbook_url: "https://wiki.internal/runbooks/kafka-offline-partitions"
- alert: KafkaUnderReplicatedPartitions
expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka under-replicated partitions"
description: "{{ $value }} under-replicated partitions across the cluster for more than 5 minutes."
runbook_url: "https://wiki.internal/runbooks/kafka-under-replicated"Warning alerts (alert team channel, investigate within the hour)
Alert inhibition
When a broker fails, it typically triggers a cascade: a critical broker-down alert alongside secondary consumer lag and replication warnings. Alertmanager's inhibition rules let you suppress those secondary warnings while the root-cause alert is active, keeping the on-call view clean:
inhibit_rules:
- source_match:
alertname: 'KafkaOfflinePartitions'
severity: 'critical'
target_match:
severity: 'warning'
equal: ['cluster', 'instance']
Alert fatigue note: ISR shrinks are expected during rolling restarts and broker upgrades. Consider suppressing IsrShrinksPerSec warnings during maintenance windows, or requiring a minimum duration of 10 minutes before the alert fires.
Common Kafka cluster-level issues and how to resolve them
Best practices for Kafka cluster monitoring
- Monitor
ActiveControllerCountas a cluster-wide sum, not per broker. A per-broker reading of 1 is normal; only the sum tells you whether the cluster has exactly one controller. - Treat
UnderReplicatedPartitionsas a cluster-level aggregate. Sum it across all brokers. A URP count of zero on one broker says nothing if another broker is lagging. - Separate replication traffic from consumer traffic in your throughput metrics. If
BytesOutPerSecincludes both, you cannot tell whether a spike is from a consumer or from replication catch-up. - Set alert durations on replication metrics, not just thresholds. An ISR shrink during a rolling restart is expected; one that persists for five minutes is not.
- Audit replication factors on a schedule. Topics created with non-standard replication factors are common after high-velocity development cycles and represent silent durability risk.
- In KRaft mode, add
LastAppliedRecordLagMsandMetadataErrorCountto your standard cluster health dashboard. These have no ZooKeeper equivalents and are easy to overlook when migrating an existing monitoring setup. - Run a cluster health check script on a cron schedule — every five minutes works well — and route its output to your alerting channel. It catches conditions that continuous metric alerting can miss during gaps in scrape coverage.
Monitor Kafka clusters with Kpow
Kpow connects to any Kafka cluster using the native Admin Client and Consumer APIs — no JMX Exporter, no sidecar agent, and no broker-side configuration changes required. It surfaces the cluster-level metrics covered in this article — under-replicated partitions, controller state, partition distribution, throughput per broker — on a single dashboard, with replication health and partition distribution views that update in real time.
You can give Kpow a try with a free 30-day trial. Connect it to any Kafka cluster in minutes and deploy via Docker, Helm, or JAR.