
Kafka monitoring: a complete guide for platform engineers
Table of contents
Apache Kafka is designed to be fast and fault-tolerant, but those properties only hold when you can see what is happening inside the cluster. Without instrumentation, a broker disk filling up, a follower falling behind, or a consumer group stalling can each go unnoticed until the damage is done.
This guide covers every layer of the Kafka monitoring surface: brokers, replication, consumers, producers, JVM, storage, and tooling. It is written for platform engineers and data engineers who are responsible for running Kafka in production and who need concrete, actionable guidance rather than a catalogue of metric names.
What is Kafka monitoring?
Kafka monitoring is the practice of collecting, observing, and alerting on the operational signals that describe the health and performance of a running Kafka cluster. The monitoring surface spans three layers:
- JMX metrics exposed by the broker and client JVMs through Java Management Extensions, covering everything from replication state to request handler utilisation
- OS and infrastructure metrics that describe the host resources Kafka depends on: disk I/O, network throughput, file descriptors, and page cache pressure
- Application-level metrics that measure the end-to-end impact of Kafka on the business: consumer lag as a time delay, producer error rates, and pipeline latency
Metrics are exposed via two separate registries. Server-side internals use the Yammer Metrics library; Java-based clients use the Kafka Metrics registry. Both expose their telemetry through JMX, making performance data available to external systems such as Prometheus exporters, Datadog, and Dynatrace.
By default, remote JMX access is disabled to prevent unauthenticated access to the JVM. To enable it, set the JMX_PORT environment variable in your broker startup script, or configure the com.sun.management.jmxremote.* system properties explicitly. You can verify metric exposure with Kafka's built-in kafka-run-class.sh kafka.tools.JmxTool.
KRaft and ZooKeeper
ZooKeeper was deprecated in Kafka 3.x and removed entirely in Kafka 4.0. Modern clusters run in KRaft mode, where metadata is managed via a Raft-based consensus protocol inside the broker JVM itself. This architectural shift changes what you monitor: ZooKeeper session health, ZkClient/SessionExpiredPerSec, and the ControlPlaneNetworkProcessorAvgIdlePercent metric are no longer present. In their place, KRaft introduces consensus metrics under kafka.server:type=raft-metrics and a new set of metadata loader and controller queue metrics. The relevant KRaft-specific sections are called out throughout this guide.
Why Kafka monitoring matters
Kafka is designed to be resilient, but specific failure modes are silent until they cause data loss or downstream outages.
Replication lag becomes data loss. When a follower falls out of the In-Sync Replicas (ISR) set, the cluster's effective replication factor drops. If a second broker fails before the first recovers, you can lose data permanently. This happens in minutes; without an alert on UnderReplicatedPartitions, you will not know until a producer starts receiving NotEnoughReplicasException.
Consumer lag is an invisible SLA breach. A consumer group can be running, committing offsets, and processing records while accumulating a lag of millions of messages behind the producer. The application looks healthy. Downstream dashboards, notifications, and fraud checks are operating on data that is hours old.
Broker failure cascades. A single slow broker can trigger ISR shrinks across every partition it leads, which can then cause leader elections and controller re-elections if the GC pause lasts long enough. What starts as a noisy garbage collection cycle on one node can propagate into a cluster-wide availability event within a minute.
The business cost is concrete. A post-incident analysis of a production Kubernetes-hosted Kafka failure documented $240,000 in SLA penalties over 62 minutes, driven by a cascading broker crash that pushed consumer lag to 14 hours across 10,000 topics.
The 10 most critical Kafka metrics
The table below is the minimum monitoring footprint for any production Kafka cluster. Every metric here has caused production incidents when left unmonitored.
Kafka monitoring best practices
1. Instrument with JMX exporter and Prometheus from day one
Retrofitting observability onto a running Kafka cluster is painful. The Prometheus JMX Exporter runs as a Java agent inside the broker JVM, translating nested MBean attributes into flat, label-enriched Prometheus series. Configuring it requires a static YAML rule file and a JVM flag at startup:
export KAFKA_OPTS="$KAFKA_OPTS -javaagent:./jmx_prometheus_javaagent.jar=8080:./kafka-metrics.yml"
Consumer group lag is not exposed through JMX directly; it requires a separate kafka-exporter that polls the AdminClient API and calculates the delta between log end offsets and committed consumer offsets. Deploy both from the start.
2. Alert on UnderReplicatedPartitions > 0 without delay
A non-zero under-replicated partition count means your replication factor is not being honoured. The cluster is operating with reduced durability. If one more broker fails before the ISR recovers, you can permanently lose data. There is no steady-state condition that produces a non-zero reading in a healthy cluster, so any positive value should trigger an immediate alert and investigation.
Pay equal attention to the ISR shrink rate (IsrShrinksPerSec). Many teams watch only for offline partitions, which means silent degradation: the ISR can shrink from 3 to 2 to 1 over hours without triggering an alert, only surfacing as a critical incident when the last replica fails.
3. Track consumer lag as a rate of change, not just an absolute value
A lag of 50,000 messages on a topic that processes 100,000 messages per second represents 500ms of delay. On a topic that processes 10 messages per second, it represents more than 80 minutes. An absolute offset threshold is meaningless without context.
Implement time-based lag thresholds wherever possible. A practical formula for topics where time-based monitoring is not directly available:
Critical threshold (in messages) = target_SLO_seconds × consumption_rate_per_second × (1 + safety_margin)
For a payment processor consuming 100 messages per second with a 120-second SLO and a 20% safety margin: 120 × 100 × 1.2 = 14,400 messages.
Alert on a growing lag trend before the lag itself becomes critical. LinkedIn's open-source Burrow evaluates lag over a sliding window of offset commits rather than at an absolute threshold, categorising consumer group health as OK, WARNING, ERROR, or STALL, catching stalled consumers even when their lag is temporarily low.
4. Set JVM heap and GC budgets before tuning anything else
Kafka brokers run on the JVM. A garbage collection pause on a follower broker suspends its ReplicaFetcherThread. If that pause exceeds replica.lag.time.max.ms (default 30 seconds), the partition leader evicts the follower from the ISR. The resulting ISR shrink can cascade into write failures for producers using acks=all if the ISR drops below min.insync.replicas.
For brokers with heap sizes under 16GB, G1GC with -XX:MaxGCPauseMillis=20 and -XX:InitiatingHeapOccupancyPercent=35 is a sensible baseline. Set -XX:G1HeapRegionSize=16M to prevent large message batches from being classified as humongous objects. For large heaps (32GB+), Generational ZGC (-XX:+UseZGC -XX:+ZGenerational) delivers sub-millisecond pause times at the cost of roughly 10-20% more CPU.
The key monitoring signal: a GC pause exceeding 200ms P99 should trigger a warning. A pause exceeding 500ms P99 should trigger a page.
5. Distinguish broker-side lag from consumer-side lag before taking action
When consumer lag is growing, the cause is either upstream (the producer is writing faster than the consumer can keep up) or downstream (the consumer is processing too slowly). These have different remediation paths.
Start by comparing BytesInPerSec (producer throughput) against consumer polling rate. If throughput has spiked, add partitions and consumer instances. If throughput is flat but lag is growing, look at consumer processing time: check records-consumed-rate versus records-lag-max, and look for slow synchronous I/O inside the poll loop, such as database writes, external HTTP calls, or CPU-bound transformations. These are the most common source of consumer lag in practice.
Also check for JVM GC pauses on the consumer side. A consumer GC pause exceeding session.timeout.ms (default 45 seconds) causes the broker to evict the consumer and trigger a rebalance, which itself pauses consumption and accelerates lag accumulation.
6. Use per-topic and per-partition views, not just cluster-level aggregates
Cluster-level aggregates mask partition-level problems. A single hot partition receiving 80% of traffic, or a single topic with a replication.factor of 1 in a three-broker cluster, will not show up in aggregate health dashboards.
Monitor LeaderCount and PartitionCount per broker to detect partition imbalance. An uneven leader distribution means some brokers are handling a disproportionate share of read and write operations. Use kafka-leader-election.sh for preferred leader elections when imbalance is detected.
For consumer lag, alert at the consumer group + topic + partition level, not just at the group level. Partition hot-spotting, where a single partition receives a disproportionate share of traffic due to key skew, concentrates lag on one consumer thread while others are idle.
7. Test your alerting pipeline regularly
An alert that fires but never reaches the on-call engineer is operationally equivalent to having no alert at all. Run periodic drills where you deliberately trigger alert conditions in a non-production environment and verify end-to-end delivery: metric fires, alert routes, notification delivers, runbook is accessible.
For KRaft clusters, also verify that your monitoring tooling handles the new metric namespaces. Several tools still default to ZooKeeper-era MBean paths. Metrics like ControlPlaneNetworkProcessorAvgIdlePercent no longer exist in Kafka 4.0 clusters; dashboards using them will show no data without erroring, creating silent gaps.
Cluster and broker health monitoring
Broker health metrics
A healthy broker has:
ActiveControllerCountequal to 1 (on exactly one broker in the cluster)OfflinePartitionsCountequal to 0UnderReplicatedPartitionsequal to 0- Request handler and network processor idle percentages above 30%
The request processing model uses two thread pools. Network processor threads handle socket connections and read incoming data frames. Request handler (I/O) threads dequeue those frames and execute the actual disk reads and writes. The two idle percentage metrics reflect the load on each pool.
When NetworkProcessorAvgIdlePercent falls below 0.3, network threads are saturated, likely from a connection storm or a sudden spike in short-lived connections. When RequestHandlerAvgIdlePercent falls below 0.2, the I/O subsystem is the bottleneck, which typically means disk write pressure or GC-induced slowdowns.
Total request latency breaks down into:
RequestQueueTimeMs— time waiting in queue for a handler threadLocalTimeMs— active processing on the partition leaderRemoteTimeMs— waiting for follower acknowledgment (whenacks=all)ResponseSendTimeMs— serialising and writing the response to the socket
A spike in RemoteTimeMs points to follower lag or network issues between brokers. A spike in LocalTimeMs points to disk pressure on the leader.
Other key broker metrics:
JVM and GC monitoring
Kafka brokers are JVM processes. GC pause behaviour is not a background concern: it directly causes ISR shrinks and consumer lag spikes.
The mechanics are straightforward. When a follower broker experiences a stop-the-world GC pause, its ReplicaFetcherThread is suspended. If the pause exceeds replica.lag.time.max.ms (default 30 seconds), the partition leader evaluates the replica as a "slow follower" using:
val laggingReplicas = candidateReplicas.filter(r =>
(time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)
The leader then evicts the paused follower from the ISR, raising IsrShrinksPerSec and driving UnderReplicatedPartitions above zero. If min.insync.replicas is not met after the eviction, producers with acks=all receive NotEnoughReplicasException.
The same dynamic affects consumers. A consumer JVM GC pause that exceeds session.timeout.ms (default 45 seconds) causes the Group Coordinator to evict the consumer and trigger a rebalance. All consumers in the group pause during the rebalance, accumulating lag.
Key JVM metrics to monitor:
G1GC vs ZGC at a glance: G1GC is the sensible default for heaps under 16GB. Its main risk is humongous object allocation: any object over 50% of the G1 region size is allocated directly into the Old Generation, bypassing Young Generation copy collection and causing heap fragmentation. Setting -XX:G1HeapRegionSize=16M raises this threshold to 8MB, which covers most Kafka message batches. Generational ZGC (available in JDK 21+) delivers sub-millisecond pause times by running all collection phases concurrently with the application, but requires at least 8 vCPUs and 32GB of heap to run stably, and consumes 10-20% more CPU than G1GC. For containerised deployments with cgroup CPU limits below 4 vCPUs, G1GC is the safer choice.
KRaft metadata monitoring
Kafka 4.0 removes ZooKeeper entirely. In KRaft mode, a subset of brokers act as controllers, maintaining the cluster metadata log (__cluster_metadata) via the Raft consensus protocol.
What ZooKeeper monitoring you can remove:
zookeeper.connect,zookeeper.session.timeout.ms,ZkClient/SessionExpiredPerSecControlPlaneNetworkProcessorAvgIdlePercent(replaced by the unifiedNetworkProcessorAvgIdlePercent)- ZooKeeper
ruokport-2181 health checks
What to add for KRaft:
The broker registration state (kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X) has three states: Fenced (not participating in leadership), Controlled Shutdown (graceful exit), and Active (fully operational). Fenced brokers do not serve client traffic.
For manual diagnostics, use kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status to inspect active leader, current epoch, high watermark, and voter node IDs. This replaces zookeeper-shell.sh for quorum health checks.
Topic and partition health monitoring
Topic-level metrics reveal problems that broker-level aggregates hide.
Key metrics:
Partition imbalance is detected by comparing LeaderCount across brokers. In a healthy cluster, partition leadership is distributed proportionally. If one broker holds 60% of leaders while others hold 20% each, read and write traffic is concentrated on one node, driving up its request handler utilisation while others sit idle.
A spike in MessagesInPerSec that is not accompanied by a proportional increase in BytesInPerSec can indicate a change in average message size, which can affect whether messages are classified as humongous objects in the G1GC allocator.
Consumer lag monitoring
Consumer lag is the most operationally significant Kafka metric for most teams. It is the direct measurement of how far behind downstream processing is from real-time.
The formula is simple:
Consumer lag = Log End Offset (partition) - Consumer Committed Offset (partition)
The complexity lies in what that number means. A consumer group may be processing 9,000 messages per second while an upstream producer pushes 10,000. The consumer is running. Throughput dashboards show it as healthy. But it is accumulating 1,000 messages of lag every second.
Consumer lag patterns
Lag does not accumulate uniformly. Different accumulation patterns indicate different root causes:
What causes consumer lag in practice
Slow processing inside the poll loop. Synchronous database writes, blocking HTTP calls, and CPU-bound transformations inside the record processing loop are the most frequent cause. Each blocked thread cannot call poll(), causing the broker to eventually view the consumer as dead and trigger a rebalance.
JVM GC pauses. A consumer GC pause exceeding session.timeout.ms (default 45 seconds) causes the Group Coordinator to evict the consumer and trigger a full group rebalance. All consumers pause during rebalance. The recovered consumer then faces a large backlog, which can exceed max.poll.interval.ms if processing is slow, triggering another rebalance.
Poison pill records. A single malformed record that a consumer cannot process without throwing an exception will cause that consumer to crash-restart in an infinite loop against the same offset. Lag accumulates linearly on that partition while the consumer cycles. Every record queued behind the bad offset is blocked.
Partition hot-spotting. When a business key is highly concentrated (for example, a major tenant ID), one partition receives a disproportionate share of traffic. One consumer thread is at 100% CPU while the rest are idle. Lag accumulates on the hot partition while aggregate lag metrics look manageable.
Transactional producer stalls. Consumers in read_committed isolation mode will not receive messages from open, uncommitted transactions. If a transactional producer crashes with an open transaction, downstream consumers stall until the transaction coordinator times out the transaction, which defaults to transaction.timeout.ms = 60000 (60 seconds).
Consumer lag monitoring approach
Use time-based thresholds where possible, with different SLOs per topic type:
Fraud detection: Warning at 10s lag | Critical at 30s
Payment processor: Warning at 30s lag | Critical at 2m
Analytics ETL: Warning at 10m lag | Critical at 30m
For offset-based thresholds where time-based metrics are unavailable:
Critical threshold = target_SLO_seconds × consumption_rate_per_second × (1 + safety_margin)
For a payment processor consuming 100 messages/second with a 120-second SLO and a 20% safety margin: 120 × 100 × 1.2 = 14,400 messages.
Alert on the rate of change of lag, not just its current value. A lag that is 5,000 messages and growing at 1,000 messages per minute needs attention before it crosses any absolute threshold.
Producer monitoring
Kafka producers expose client-side metrics under kafka.producer:type=producer-metrics,client-id=(...).
Key producer metrics:
The failure cascade for a saturated producer works as follows: the broker enforces a quota and delays its response to the producer. The producer retains batches longer, buffer-available-bytes falls. Once the buffer is exhausted, application threads calling send() block, waiting-threads spikes, and the application is effectively paused.
A sudden rise in record-error-rate can indicate network problems, broker-side disk issues, ACL misconfigurations, or the producer hitting max.message.bytes. Check the corresponding broker metric BytesRejectedPerSec (kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec) to determine if the broker is actively rejecting oversized batches.
Replication and data durability monitoring
Replication metrics are the most critical class of Kafka metrics. They describe whether your data is actually protected.
The key metrics:
UncleanLeaderElectionsPerSec > 0 indicates a data loss event. An out-of-sync replica became the partition leader, which means any records that were committed to the previous leader's log but not yet replicated to this replica are permanently gone. This is not a warning state; it is an incident that requires forensic investigation.
The relationship between replication.factor, min.insync.replicas, and ISR count determines your durability envelope. With a 3-replica topic and min.insync.replicas=2:
- ISR = 3: full durability, normal operation
- ISR = 2: degraded redundancy, writes succeed, one more failure loses data
- ISR = 1:
acks=allwrites fail; the cluster is protecting you from data loss by refusing writes
Monitor IsrShrinksPerSec as a leading indicator before checking UnderReplicatedPartitions. The shrink rate tells you the cluster is in the process of degrading; the partition count tells you how far it has degraded.
Performance and throughput monitoring
Throughput metrics reveal whether the cluster is approaching a resource ceiling, and which resource is the binding constraint.
Throughput metrics:
To identify whether a throughput ceiling is a network, disk, or CPU constraint:
- Network saturation:
BytesInPerSec + BytesOutPerSecapproaching NIC bandwidth, combined with a drop inNetworkProcessorAvgIdlePercent - Disk saturation:
RequestHandlerAvgIdlePercentdropping whileLocalTimeMsincreases; elevatedLogFlushRateAndTimeMs - CPU saturation: Host CPU above 70% sustained, often correlated with message format conversions — check
MessageConversionsTimeMs
Note that BytesOutPerSec is typically two to three times BytesInPerSec in a cluster with a replication factor of 3, because follower fetch traffic is counted as outbound. A sudden rise in BytesOutPerSec without a corresponding rise in BytesInPerSec can indicate a catch-up read: a consumer group that fell behind is now reading cold log segments from disk, which also degrades page cache for all other consumers on that broker.
Storage and retention monitoring
Kafka fills disks faster than most teams expect. The calculation is: data_rate × replication_factor × retention_period. A 1 GB/s ingest rate with replication factor 3 and 7-day retention requires 1.8 TB of raw disk per broker at minimum, before accounting for segment overhead.
Storage metrics to track:
Log compaction can fall behind write rate in high-throughput compacted topics, causing the log to grow beyond what compaction alone would maintain. The leading indicator is a growing LogEndOffset on compacted topics without corresponding LogStartOffset advancement. Compaction threads are configured via log.cleaner.threads; if the disk is I/O bound, adding cleaner threads will not help — you need faster storage.
Retention policy misconfiguration is a common source of disk exhaustion. Time-based retention (log.retention.hours) and size-based retention (log.retention.bytes) interact: Kafka enforces whichever threshold is hit first. A topic configured for 7-day retention with no size limit on a high-throughput cluster can consume terabytes before the time window triggers.
Infrastructure and OS metrics monitoring
Kafka is sensitive to the underlying host in ways that most applications are not.
Disk I/O. Kafka is a disk-bound workload. It relies on the OS page cache for low-latency reads and sequential writes. Monitor iowait and disk read/write throughput at the OS level. Elevated iowait correlates directly with increased LocalTimeMs in broker request latency.
Network. With a replication factor of 3, every byte written to a broker is retransmitted twice to followers. Monitor both inbound and outbound interface utilisation. On cloud instances, also watch for traffic-shaping events (AWS CloudWatch TrafficShaping metric): cloud providers will silently throttle network traffic when a burstable instance hits its baseline, causing latency spikes that are difficult to attribute without this metric.
Page cache. The OS page cache is Kafka's primary read acceleration mechanism. Consumer groups that are actively tailing the log read from page cache, not from disk. A consumer group that falls significantly behind forces the broker into cold disk reads, which evict hot data from the page cache and degrade all other consumers on that broker. Monitor OS-level memory pressure: if the page cache is under pressure, iowait and request latency will climb together.
File descriptors. Kafka opens one file descriptor per log segment. In a cluster with many topics and partitions, file descriptor exhaustion is a real failure mode. Set nofile limits to at least 100,000 per process and monitor current FD usage.
Memory swappiness. Set vm.swappiness = 1 to keep JVM heap pages in physical RAM. Swapping JVM pages to disk during garbage collection cycles introduces severe disk I/O latency, compounding pause times.
Kafka Connect monitoring
Kafka Connect is a separate JVM process with its own monitoring surface.
Connector and task state. The primary health signal for Connect is the connector and task status. A connector can be in RUNNING, PAUSED, or FAILED state. Tasks have the same states, plus UNASSIGNED. A FAILED task that does not recover automatically requires investigation of the Connect worker log.
Key Connect metrics:
For source connectors, monitor the lag between the source system and the Kafka topic. For database CDC connectors, this is typically the replication slot lag in the source database.
For sink connectors, monitor the offset commit latency (offset-commit-success-percentage) and any increase in task restart counts. A task that restarts repeatedly is often encountering a poison-pill record in the topic, a downstream write failure that is not being handled by the dead-letter queue configuration, or a connection timeout to the downstream system.
Error records delivered to dead-letter queues should also be monitored. A growing DLQ depth indicates that the connector is encountering records it cannot process, which may signal a schema change, a data quality issue, or a configuration mismatch.
Kafka Streams monitoring
Kafka Streams applications expose metrics under kafka.streams MBeans.
Thread and task state:
For stateful topologies with RocksDB state stores:
The most common operational issue in Kafka Streams is excessive state store size causing the application to lag behind its input topic. Monitor the processing rate (process-rate) against the input topic's MessagesInPerSec. If the Streams application is consistently slower than its input, the topology needs more parallelism, or the state store operations need to be profiled.
Application and business-level monitoring
Infrastructure metrics tell you that the cluster is running. Business-level metrics tell you whether it is doing what you need it to do.
For most teams, the most useful business-level signal derived from Kafka metrics is end-to-end latency: the time between a record being written by a producer and that record being processed by a consumer. This is consumer lag expressed as wall-clock time rather than offset distance.
How to build this:
- Produce a timestamp in the record metadata (or use the Kafka producer timestamp if reliable)
- At the consumer, calculate
now - record.timestampfor each processed record - Track this as a histogram; alert on P95 and P99, not just the mean
Concrete examples of business-level Kafka observability:
- An order processing pipeline: lag in the order-confirmed topic expressed as seconds of latency between order placement and confirmation email dispatch
- A fraud detection pipeline: lag in the transactions topic as a direct multiplier on fraud window size — a 5-minute lag means 5 additional minutes of unchecked transactions
- A real-time dashboard: lag in the events topic as the staleness of displayed data
Building a "Kafka health to business outcome" dashboard that shows consumer lag alongside the downstream metric it affects gives on-call engineers the context to treat lag alerts with appropriate urgency. When it is clear that 2 minutes of lag on the payment-events topic means 2 minutes of delayed fraud detection, the alert becomes easier to prioritise correctly.
Alerting and incident detection
The following table covers the minimum required alert set for a production Kafka cluster. Use this as a starting point and adjust thresholds based on your SLOs.
Alert philosophy notes:
- Alert on
OfflinePartitionsandUncleanLeaderElectionsunconditionally — there is no benign explanation for these conditions in a healthy cluster. - Alert on
UnderReplicatedPartitionsonly after a sustained window (5-10 minutes) to avoid false positives during normal broker restarts and rolling updates. - Avoid alerting on consumer lag using a raw offset count. Express it relative to throughput (time-based) or use a trend-based approach like Burrow.
IsrShrinksPerSecis a leading indicator;UnderReplicatedPartitionsis a lagging one. Alert on both, at different severities.
Capacity planning and forecasting
Capacity planning for Kafka requires tracking how quickly the cluster is consuming its three primary resources: network, disk, and broker count.
Disk capacity projection. The formula for disk required per broker is:
(ingestion_rate × replication_factor × retention_period) / broker_count
At 500 MB/s ingestion with replication factor 3 and 7-day retention across 6 brokers: 500 MB/s × 3 × 604,800s / 6 = ~150 TB raw storage per broker.
Track the rate of disk growth via BytesInPerSec and project forward at current growth rate. Alert when you have less than 30 days of headroom at current trajectory.
Partition count and broker density. In KRaft mode, the historical ZooKeeper-era limit of roughly 4,000 partitions per broker and 200,000 partitions per cluster no longer applies. KRaft's Raft-based metadata management supports millions of partitions. However, each partition still consumes file descriptors, open segment handles, and memory for replication tracking. A practical upper bound of 10,000-20,000 partitions per broker remains sensible for most workloads.
Rebalancing cost. When adding brokers to expand capacity, Kafka does not automatically redistribute partitions. Use Cruise Control or kafka-reassign-partitions.sh to generate and execute reassignment plans. Always throttle reassignment throughput using --throttle <bytes_per_second> to prevent replication traffic from saturating broker network interfaces during business hours.
Kafka monitoring tools
The choice of monitoring tool affects both what you get out of the box and how much engineering is required to reach full coverage.
What requires custom configuration in each tool:
- Prometheus/Grafana: Consumer lag (separate kafka-exporter), custom MBean patterns in the
kafka-2_0_0.ymlYAML rule file, Grafana dashboard variable mapping. JMX histogram metrics are converted to static gauges, losing cumulative sum data needed for accurate rate calculations. - Datadog: JMX-enabled agent image (
-jmxtag), pod annotations for Autodiscovery, custommetrics.yamlfilter to stay within the 350-metric-per-instance limit. AWS MSK clusters require a separate CloudWatch integration — the standard Kafka agent check cannot query MSK directly. - Dynatrace: Topic name filters to prevent metric cardinality explosion (and the associated Davis Data Unit billing overage), and a custom
plugin.jsonschema for non-standard MBeans. The Davis AI anomaly detection engine baselines normal behaviour automatically, which is particularly valuable for consumer lag patterns that vary with traffic shape. - Confluent Control Center: Proprietary Metrics Reporter JAR on every broker classpath, rolling cluster restart to activate. Legacy versions (CP 7.x) use client-side monitoring interceptors that create roughly 50 internal topics and significant metadata overhead. CP 8.0 migrates to Prometheus-based pull collection, requiring a migration if upgrading from the interceptor model. Not compatible with non-Confluent distributions (OSS Kafka, MSK, Aiven).
- Kpow: No JMX or broker-side agent. Kpow queries the AdminClient API directly, eliminating JMX serialisation overhead on the broker. Streams topology visualisation requires embedding the
kpow-streams-agentdependency in the application code. Prometheus egress requires enablingPROMETHEUS_EGRESS=trueand configuring basic authentication.
For organisations that need to minimise broker resource overhead while maintaining full observability, running Prometheus paired with Kpow is a practical architecture: Kpow's metadata engine eliminates JMX sidecars on the brokers, and its Prometheus egress endpoint feeds clean telemetry into a centralised Prometheus TSDB. You can try Kpow free for 30 days — it connects to any Kafka cluster in minutes and can be deployed via Docker, Helm, or JAR.
Summary
Kafka monitoring is not a single dashboard. It is a set of overlapping layers: broker internals, replication state, JVM health, consumer lag, producer behaviour, storage capacity, and OS resources. Each layer has distinct failure modes, and a blind spot in any one of them can become a production incident.
The minimum viable monitoring set for any production cluster:
- Alert immediately on
OfflinePartitionsCount > 0andUncleanLeaderElectionsPerSec > 0 - Alert after a sustained window on
UnderReplicatedPartitions > 0 - Track
IsrShrinksPerSecas a leading indicator of durability degradation - Use time-based consumer lag thresholds, not raw offset counts
- Monitor JVM GC pause duration as a direct predictor of ISR instability
- Watch
RequestHandlerAvgIdlePercentandNetworkProcessorAvgIdlePercentfor broker saturation - For KRaft clusters, add quorum health metrics from
kafka.server:type=raft-metrics
The rest — per-topic metrics, Connect and Streams coverage, business-level lag thresholds — is built on top of this foundation.