Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

Kafka monitoring: a complete guide for platform engineers

Table of contents

Factor House
May 27th, 2026
xx min read

Apache Kafka is designed to be fast and fault-tolerant, but those properties only hold when you can see what is happening inside the cluster. Without instrumentation, a broker disk filling up, a follower falling behind, or a consumer group stalling can each go unnoticed until the damage is done.

This guide covers every layer of the Kafka monitoring surface: brokers, replication, consumers, producers, JVM, storage, and tooling. It is written for platform engineers and data engineers who are responsible for running Kafka in production and who need concrete, actionable guidance rather than a catalogue of metric names.

What is Kafka monitoring?

Kafka monitoring is the practice of collecting, observing, and alerting on the operational signals that describe the health and performance of a running Kafka cluster. The monitoring surface spans three layers:

  • JMX metrics exposed by the broker and client JVMs through Java Management Extensions, covering everything from replication state to request handler utilisation
  • OS and infrastructure metrics that describe the host resources Kafka depends on: disk I/O, network throughput, file descriptors, and page cache pressure
  • Application-level metrics that measure the end-to-end impact of Kafka on the business: consumer lag as a time delay, producer error rates, and pipeline latency

Metrics are exposed via two separate registries. Server-side internals use the Yammer Metrics library; Java-based clients use the Kafka Metrics registry. Both expose their telemetry through JMX, making performance data available to external systems such as Prometheus exporters, Datadog, and Dynatrace.

By default, remote JMX access is disabled to prevent unauthenticated access to the JVM. To enable it, set the JMX_PORT environment variable in your broker startup script, or configure the com.sun.management.jmxremote.* system properties explicitly. You can verify metric exposure with Kafka's built-in kafka-run-class.sh kafka.tools.JmxTool.

KRaft and ZooKeeper

ZooKeeper was deprecated in Kafka 3.x and removed entirely in Kafka 4.0. Modern clusters run in KRaft mode, where metadata is managed via a Raft-based consensus protocol inside the broker JVM itself. This architectural shift changes what you monitor: ZooKeeper session health, ZkClient/SessionExpiredPerSec, and the ControlPlaneNetworkProcessorAvgIdlePercent metric are no longer present. In their place, KRaft introduces consensus metrics under kafka.server:type=raft-metrics and a new set of metadata loader and controller queue metrics. The relevant KRaft-specific sections are called out throughout this guide.

Why Kafka monitoring matters

Kafka is designed to be resilient, but specific failure modes are silent until they cause data loss or downstream outages.

Replication lag becomes data loss. When a follower falls out of the In-Sync Replicas (ISR) set, the cluster's effective replication factor drops. If a second broker fails before the first recovers, you can lose data permanently. This happens in minutes; without an alert on UnderReplicatedPartitions, you will not know until a producer starts receiving NotEnoughReplicasException.

Consumer lag is an invisible SLA breach. A consumer group can be running, committing offsets, and processing records while accumulating a lag of millions of messages behind the producer. The application looks healthy. Downstream dashboards, notifications, and fraud checks are operating on data that is hours old.

Broker failure cascades. A single slow broker can trigger ISR shrinks across every partition it leads, which can then cause leader elections and controller re-elections if the GC pause lasts long enough. What starts as a noisy garbage collection cycle on one node can propagate into a cluster-wide availability event within a minute.

The business cost is concrete. A post-incident analysis of a production Kubernetes-hosted Kafka failure documented $240,000 in SLA penalties over 62 minutes, driven by a cascading broker crash that pushed consumer lag to 14 hours across 10,000 topics.

The 10 most critical Kafka metrics

The table below is the minimum monitoring footprint for any production Kafka cluster. Every metric here has caused production incidents when left unmonitored.

Metric JMX MBean path What it measures Healthy range Alert threshold
UnderReplicatedPartitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Partitions whose current ISR count is smaller than the replication factor 0 Page if > 0 sustained for 5 minutes
OfflinePartitionsCount kafka.controller:type=KafkaController,name=OfflinePartitionsCount Partitions with no active leader — reads and writes fail immediately 0 Page immediately if > 0
ActiveControllerCount kafka.controller:type=KafkaController,name=ActiveControllerCount Whether this broker is the active controller; sum across cluster must equal exactly 1 1 on leader, 0 on all others Page if cluster sum is not 1 for 5 minutes
records-lag-max kafka.consumer:type=consumer-fetch-manager-metrics,client-id=(...),name=records-lag-max Maximum offset lag across all partitions assigned to this consumer Topic-dependent Alert on rate of growth, not absolute value — see consumer section
RequestHandlerAvgIdlePercent kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Fraction of time I/O request handler threads are idle > 0.6 (60%) Warning at < 0.3; critical at < 0.2
NetworkProcessorAvgIdlePercent kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent Fraction of time network processor threads are idle > 0.3 (30%) Warning if sustained < 0.3
BytesInPerSec / BytesOutPerSec kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec / BytesOutPerSec Total inbound and outbound byte throughput Varies by cluster Alert if approaching NIC saturation; track trend not absolute
GC pause duration java.lang:type=GarbageCollector,name=G1 Young Generation (or ZGC equivalent) Stop-the-world pause time in milliseconds < 200ms (G1GC), < 5ms (ZGC) Warning at > 200ms P99; critical at > 500ms P99
Disk usage % OS metric (or kafka.log:type=Log,name=Size,topic=(...),partition=(...)) Disk capacity consumed relative to total; Kafka fills disks faster than expected due to replication < 70% Warning at > 70%; critical at > 85%
IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrShrinksPerSec Rate at which replicas are being dropped from the ISR 0 Warn on any sustained positive rate; page if growing rapidly

Kafka monitoring best practices

1. Instrument with JMX exporter and Prometheus from day one

Retrofitting observability onto a running Kafka cluster is painful. The Prometheus JMX Exporter runs as a Java agent inside the broker JVM, translating nested MBean attributes into flat, label-enriched Prometheus series. Configuring it requires a static YAML rule file and a JVM flag at startup:

export KAFKA_OPTS="$KAFKA_OPTS -javaagent:./jmx_prometheus_javaagent.jar=8080:./kafka-metrics.yml"

Consumer group lag is not exposed through JMX directly; it requires a separate kafka-exporter that polls the AdminClient API and calculates the delta between log end offsets and committed consumer offsets. Deploy both from the start.

2. Alert on UnderReplicatedPartitions > 0 without delay

A non-zero under-replicated partition count means your replication factor is not being honoured. The cluster is operating with reduced durability. If one more broker fails before the ISR recovers, you can permanently lose data. There is no steady-state condition that produces a non-zero reading in a healthy cluster, so any positive value should trigger an immediate alert and investigation.

Pay equal attention to the ISR shrink rate (IsrShrinksPerSec). Many teams watch only for offline partitions, which means silent degradation: the ISR can shrink from 3 to 2 to 1 over hours without triggering an alert, only surfacing as a critical incident when the last replica fails.

3. Track consumer lag as a rate of change, not just an absolute value

A lag of 50,000 messages on a topic that processes 100,000 messages per second represents 500ms of delay. On a topic that processes 10 messages per second, it represents more than 80 minutes. An absolute offset threshold is meaningless without context.

Implement time-based lag thresholds wherever possible. A practical formula for topics where time-based monitoring is not directly available:

Critical threshold (in messages) = target_SLO_seconds × consumption_rate_per_second × (1 + safety_margin)

For a payment processor consuming 100 messages per second with a 120-second SLO and a 20% safety margin: 120 × 100 × 1.2 = 14,400 messages.

Alert on a growing lag trend before the lag itself becomes critical. LinkedIn's open-source Burrow evaluates lag over a sliding window of offset commits rather than at an absolute threshold, categorising consumer group health as OK, WARNING, ERROR, or STALL, catching stalled consumers even when their lag is temporarily low.

4. Set JVM heap and GC budgets before tuning anything else

Kafka brokers run on the JVM. A garbage collection pause on a follower broker suspends its ReplicaFetcherThread. If that pause exceeds replica.lag.time.max.ms (default 30 seconds), the partition leader evicts the follower from the ISR. The resulting ISR shrink can cascade into write failures for producers using acks=all if the ISR drops below min.insync.replicas.

For brokers with heap sizes under 16GB, G1GC with -XX:MaxGCPauseMillis=20 and -XX:InitiatingHeapOccupancyPercent=35 is a sensible baseline. Set -XX:G1HeapRegionSize=16M to prevent large message batches from being classified as humongous objects. For large heaps (32GB+), Generational ZGC (-XX:+UseZGC -XX:+ZGenerational) delivers sub-millisecond pause times at the cost of roughly 10-20% more CPU.

The key monitoring signal: a GC pause exceeding 200ms P99 should trigger a warning. A pause exceeding 500ms P99 should trigger a page.

5. Distinguish broker-side lag from consumer-side lag before taking action

When consumer lag is growing, the cause is either upstream (the producer is writing faster than the consumer can keep up) or downstream (the consumer is processing too slowly). These have different remediation paths.

Start by comparing BytesInPerSec (producer throughput) against consumer polling rate. If throughput has spiked, add partitions and consumer instances. If throughput is flat but lag is growing, look at consumer processing time: check records-consumed-rate versus records-lag-max, and look for slow synchronous I/O inside the poll loop, such as database writes, external HTTP calls, or CPU-bound transformations. These are the most common source of consumer lag in practice.

Also check for JVM GC pauses on the consumer side. A consumer GC pause exceeding session.timeout.ms (default 45 seconds) causes the broker to evict the consumer and trigger a rebalance, which itself pauses consumption and accelerates lag accumulation.

6. Use per-topic and per-partition views, not just cluster-level aggregates

Cluster-level aggregates mask partition-level problems. A single hot partition receiving 80% of traffic, or a single topic with a replication.factor of 1 in a three-broker cluster, will not show up in aggregate health dashboards.

Monitor LeaderCount and PartitionCount per broker to detect partition imbalance. An uneven leader distribution means some brokers are handling a disproportionate share of read and write operations. Use kafka-leader-election.sh for preferred leader elections when imbalance is detected.

For consumer lag, alert at the consumer group + topic + partition level, not just at the group level. Partition hot-spotting, where a single partition receives a disproportionate share of traffic due to key skew, concentrates lag on one consumer thread while others are idle.

7. Test your alerting pipeline regularly

An alert that fires but never reaches the on-call engineer is operationally equivalent to having no alert at all. Run periodic drills where you deliberately trigger alert conditions in a non-production environment and verify end-to-end delivery: metric fires, alert routes, notification delivers, runbook is accessible.

For KRaft clusters, also verify that your monitoring tooling handles the new metric namespaces. Several tools still default to ZooKeeper-era MBean paths. Metrics like ControlPlaneNetworkProcessorAvgIdlePercent no longer exist in Kafka 4.0 clusters; dashboards using them will show no data without erroring, creating silent gaps.

Cluster and broker health monitoring

Broker health metrics

A healthy broker has:

  • ActiveControllerCount equal to 1 (on exactly one broker in the cluster)
  • OfflinePartitionsCount equal to 0
  • UnderReplicatedPartitions equal to 0
  • Request handler and network processor idle percentages above 30%

The request processing model uses two thread pools. Network processor threads handle socket connections and read incoming data frames. Request handler (I/O) threads dequeue those frames and execute the actual disk reads and writes. The two idle percentage metrics reflect the load on each pool.

When NetworkProcessorAvgIdlePercent falls below 0.3, network threads are saturated, likely from a connection storm or a sudden spike in short-lived connections. When RequestHandlerAvgIdlePercent falls below 0.2, the I/O subsystem is the bottleneck, which typically means disk write pressure or GC-induced slowdowns.

Total request latency breaks down into:

  • RequestQueueTimeMs — time waiting in queue for a handler thread
  • LocalTimeMs — active processing on the partition leader
  • RemoteTimeMs — waiting for follower acknowledgment (when acks=all)
  • ResponseSendTimeMs — serialising and writing the response to the socket

A spike in RemoteTimeMs points to follower lag or network issues between brokers. A spike in LocalTimeMs points to disk pressure on the leader.

Other key broker metrics:

Metric JMX MBean path Alert note
LeaderElectionRateAndTimeMs kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs Any election is notable; frequent elections indicate instability
UncleanLeaderElectionsPerSec kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec Any non-zero value means data loss has occurred — page immediately
RequestQueueSize kafka.network:type=RequestChannel,name=RequestQueueSize Alert if sustained > 10
PurgatorySize (Produce) kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Growing size indicates slow followers or disk saturation

JVM and GC monitoring

Kafka brokers are JVM processes. GC pause behaviour is not a background concern: it directly causes ISR shrinks and consumer lag spikes.

The mechanics are straightforward. When a follower broker experiences a stop-the-world GC pause, its ReplicaFetcherThread is suspended. If the pause exceeds replica.lag.time.max.ms (default 30 seconds), the partition leader evaluates the replica as a "slow follower" using:

val laggingReplicas = candidateReplicas.filter(r =>
 (time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)

The leader then evicts the paused follower from the ISR, raising IsrShrinksPerSec and driving UnderReplicatedPartitions above zero. If min.insync.replicas is not met after the eviction, producers with acks=all receive NotEnoughReplicasException.

The same dynamic affects consumers. A consumer JVM GC pause that exceeds session.timeout.ms (default 45 seconds) causes the Group Coordinator to evict the consumer and trigger a rebalance. All consumers in the group pause during the rebalance, accumulating lag.

Key JVM metrics to monitor:

Metric What to look for
GC pause duration (P99) Warning at > 200ms; critical at > 500ms
GC frequency Warning if major GC is occurring more than once every 30 seconds
Heap utilisation post-GC Warning at > 70% of max heap; critical at > 85%

G1GC vs ZGC at a glance: G1GC is the sensible default for heaps under 16GB. Its main risk is humongous object allocation: any object over 50% of the G1 region size is allocated directly into the Old Generation, bypassing Young Generation copy collection and causing heap fragmentation. Setting -XX:G1HeapRegionSize=16M raises this threshold to 8MB, which covers most Kafka message batches. Generational ZGC (available in JDK 21+) delivers sub-millisecond pause times by running all collection phases concurrently with the application, but requires at least 8 vCPUs and 32GB of heap to run stably, and consumes 10-20% more CPU than G1GC. For containerised deployments with cgroup CPU limits below 4 vCPUs, G1GC is the safer choice.

KRaft metadata monitoring

Kafka 4.0 removes ZooKeeper entirely. In KRaft mode, a subset of brokers act as controllers, maintaining the cluster metadata log (__cluster_metadata) via the Raft consensus protocol.

What ZooKeeper monitoring you can remove:

  • zookeeper.connect, zookeeper.session.timeout.ms, ZkClient/SessionExpiredPerSec
  • ControlPlaneNetworkProcessorAvgIdlePercent (replaced by the unified NetworkProcessorAvgIdlePercent)
  • ZooKeeper ruok port-2181 health checks

What to add for KRaft:

Metric JMX MBean path What it means
current-state kafka.server:type=raft-metrics,name=current-state Node's role: leader, follower, candidate, or observer. Sustained "candidate" indicates election instability
high-watermark kafka.server:type=raft-metrics,name=high-watermark Highest committed offset in the metadata log. Primary reference for consensus progress
log-end-offset kafka.server:type=raft-metrics,name=log-end-offset Latest offset written to local disk. Lag vs. high-watermark reveals replication delay
commit-latency-avg kafka.server:type=raft-metrics,name=commit-latency-avg Average time to commit a metadata operation. Spikes indicate disk write bottlenecks on the controller
MetadataLoaderIdleRatio kafka.server:type=MetadataLoader,name=MetadataLoaderIdleRatio Fraction of time the metadata loader thread is waiting. Near 0.0 means thread saturation
metadata-apply-error-count kafka.server:type=broker-metadata-metrics,name=metadata-apply-error-count Failed metadata operations — must always be 0
EventQueueSize kafka.controller:type=ControllerEventManager,name=EventQueueSize Pending administrative tasks. Alert if > 100
TimedOutBrokerHeartbeatCount kafka.controller:type=KafkaController,name=TimedOutBrokerHeartbeatCount Broker heartbeat timeouts. A growing rate indicates network saturation or broker GC issues

The broker registration state (kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X) has three states: Fenced (not participating in leadership), Controlled Shutdown (graceful exit), and Active (fully operational). Fenced brokers do not serve client traffic.

For manual diagnostics, use kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status to inspect active leader, current epoch, high watermark, and voter node IDs. This replaces zookeeper-shell.sh for quorum health checks.

Topic and partition health monitoring

Topic-level metrics reveal problems that broker-level aggregates hide.

Key metrics:

Metric JMX MBean path Notes
MessagesInPerSec kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=(...) Per-topic message ingestion rate
BytesInPerSec / BytesOutPerSec kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=(...) Per-topic byte throughput
UnderReplicatedPartitions per topic kafka.cluster:type=Partition,topic=(...),name=UnderReplicated,partition=(...) Partition-level under-replication
LogEndOffset kafka.log:type=Log,name=LogEndOffset,topic=(...),partition=(...) Latest written offset per partition

Partition imbalance is detected by comparing LeaderCount across brokers. In a healthy cluster, partition leadership is distributed proportionally. If one broker holds 60% of leaders while others hold 20% each, read and write traffic is concentrated on one node, driving up its request handler utilisation while others sit idle.

A spike in MessagesInPerSec that is not accompanied by a proportional increase in BytesInPerSec can indicate a change in average message size, which can affect whether messages are classified as humongous objects in the G1GC allocator.

Consumer lag monitoring

Consumer lag is the most operationally significant Kafka metric for most teams. It is the direct measurement of how far behind downstream processing is from real-time.

The formula is simple:

Consumer lag = Log End Offset (partition) - Consumer Committed Offset (partition)

The complexity lies in what that number means. A consumer group may be processing 9,000 messages per second while an upstream producer pushes 10,000. The consumer is running. Throughput dashboards show it as healthy. But it is accumulating 1,000 messages of lag every second.

Consumer lag patterns

Lag does not accumulate uniformly. Different accumulation patterns indicate different root causes:

Pattern Behaviour Typical cause
Gradual drift Slow linear climb over hours Minor throughput mismatch; application logic slower than required
Sudden spike Near-vertical jump Consumer crash, network partition, upstream traffic burst
Flat high lag Large stable gap, neither growing nor shrinking Consumer at capacity, capped by partition count or resource limits
Sawtooth Cyclical rise and fall Batch processing, JVM GC pauses, database commit throttling

What causes consumer lag in practice

Slow processing inside the poll loop. Synchronous database writes, blocking HTTP calls, and CPU-bound transformations inside the record processing loop are the most frequent cause. Each blocked thread cannot call poll(), causing the broker to eventually view the consumer as dead and trigger a rebalance.

JVM GC pauses. A consumer GC pause exceeding session.timeout.ms (default 45 seconds) causes the Group Coordinator to evict the consumer and trigger a full group rebalance. All consumers pause during rebalance. The recovered consumer then faces a large backlog, which can exceed max.poll.interval.ms if processing is slow, triggering another rebalance.

Poison pill records. A single malformed record that a consumer cannot process without throwing an exception will cause that consumer to crash-restart in an infinite loop against the same offset. Lag accumulates linearly on that partition while the consumer cycles. Every record queued behind the bad offset is blocked.

Partition hot-spotting. When a business key is highly concentrated (for example, a major tenant ID), one partition receives a disproportionate share of traffic. One consumer thread is at 100% CPU while the rest are idle. Lag accumulates on the hot partition while aggregate lag metrics look manageable.

Transactional producer stalls. Consumers in read_committed isolation mode will not receive messages from open, uncommitted transactions. If a transactional producer crashes with an open transaction, downstream consumers stall until the transaction coordinator times out the transaction, which defaults to transaction.timeout.ms = 60000 (60 seconds).

Consumer lag monitoring approach

Use time-based thresholds where possible, with different SLOs per topic type:

Fraud detection:       Warning at 10s lag  |  Critical at 30s
Payment processor:     Warning at 30s lag  |  Critical at 2m
Analytics ETL:         Warning at 10m lag  |  Critical at 30m

For offset-based thresholds where time-based metrics are unavailable:

Critical threshold = target_SLO_seconds × consumption_rate_per_second × (1 + safety_margin)

For a payment processor consuming 100 messages/second with a 120-second SLO and a 20% safety margin: 120 × 100 × 1.2 = 14,400 messages.

Alert on the rate of change of lag, not just its current value. A lag that is 5,000 messages and growing at 1,000 messages per minute needs attention before it crosses any absolute threshold.

Producer monitoring

Kafka producers expose client-side metrics under kafka.producer:type=producer-metrics,client-id=(...).

Key producer metrics:

Metric JMX attribute What to watch
record-send-rate record-send-rate Average records sent per second; a drop indicates backpressure
record-error-rate record-error-rate Any sustained non-zero value requires investigation
request-latency-avg request-latency-avg Warning at > 100ms avg; critical at > 500ms P99
buffer-available-bytes buffer-available-bytes Alert when approaching zero — producer is blocked
waiting-threads waiting-threads Number of application threads blocked on buffer exhaustion
produce-throttle-time-max produce-throttle-time-max Non-zero means the broker is rate-limiting this producer

The failure cascade for a saturated producer works as follows: the broker enforces a quota and delays its response to the producer. The producer retains batches longer, buffer-available-bytes falls. Once the buffer is exhausted, application threads calling send() block, waiting-threads spikes, and the application is effectively paused.

A sudden rise in record-error-rate can indicate network problems, broker-side disk issues, ACL misconfigurations, or the producer hitting max.message.bytes. Check the corresponding broker metric BytesRejectedPerSec (kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec) to determine if the broker is actively rejecting oversized batches.

Replication and data durability monitoring

Replication metrics are the most critical class of Kafka metrics. They describe whether your data is actually protected.

The key metrics:

Metric JMX MBean path What a non-zero value means
UnderReplicatedPartitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions ISR is smaller than replication factor — durability is degraded
UnderMinIsrPartitionCount kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount ISR has dropped below min.insync.replicas — acks=all writes are failing
IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrShrinksPerSec Replicas are being dropped from ISR; investigate immediately
IsrExpandsPerSec kafka.server:type=ReplicaManager,name=IsrExpandsPerSec Replicas are recovering; should follow a shrink event
UncleanLeaderElectionsPerSec kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec An out-of-sync replica was elected leader — data has been lost

UncleanLeaderElectionsPerSec > 0 indicates a data loss event. An out-of-sync replica became the partition leader, which means any records that were committed to the previous leader's log but not yet replicated to this replica are permanently gone. This is not a warning state; it is an incident that requires forensic investigation.

The relationship between replication.factor, min.insync.replicas, and ISR count determines your durability envelope. With a 3-replica topic and min.insync.replicas=2:

  • ISR = 3: full durability, normal operation
  • ISR = 2: degraded redundancy, writes succeed, one more failure loses data
  • ISR = 1: acks=all writes fail; the cluster is protecting you from data loss by refusing writes

Monitor IsrShrinksPerSec as a leading indicator before checking UnderReplicatedPartitions. The shrink rate tells you the cluster is in the process of degrading; the partition count tells you how far it has degraded.

Performance and throughput monitoring

Throughput metrics reveal whether the cluster is approaching a resource ceiling, and which resource is the binding constraint.

Throughput metrics:

Metric JMX MBean path
BytesInPerSec kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
BytesOutPerSec kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
MessagesInPerSec kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
TotalTimeMs (Produce P99) kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
TotalTimeMs (FetchConsumer P99) kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer

To identify whether a throughput ceiling is a network, disk, or CPU constraint:

  • Network saturation: BytesInPerSec + BytesOutPerSec approaching NIC bandwidth, combined with a drop in NetworkProcessorAvgIdlePercent
  • Disk saturation: RequestHandlerAvgIdlePercent dropping while LocalTimeMs increases; elevated LogFlushRateAndTimeMs
  • CPU saturation: Host CPU above 70% sustained, often correlated with message format conversions — check MessageConversionsTimeMs

Note that BytesOutPerSec is typically two to three times BytesInPerSec in a cluster with a replication factor of 3, because follower fetch traffic is counted as outbound. A sudden rise in BytesOutPerSec without a corresponding rise in BytesInPerSec can indicate a catch-up read: a consumer group that fell behind is now reading cold log segments from disk, which also degrades page cache for all other consumers on that broker.

Storage and retention monitoring

Kafka fills disks faster than most teams expect. The calculation is: data_rate × replication_factor × retention_period. A 1 GB/s ingest rate with replication factor 3 and 7-day retention requires 1.8 TB of raw disk per broker at minimum, before accounting for segment overhead.

Storage metrics to track:

Metric Notes
Disk bytes used (OS metric) Alert at > 70% capacity; critical at > 85%
kafka.log:type=Log,name=Size Log size per topic-partition
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs Log flush P99 latency — warning at > 500ms
kafka.log:type=Log,name=LogEndOffset Track for unexpected jumps or stalls

Log compaction can fall behind write rate in high-throughput compacted topics, causing the log to grow beyond what compaction alone would maintain. The leading indicator is a growing LogEndOffset on compacted topics without corresponding LogStartOffset advancement. Compaction threads are configured via log.cleaner.threads; if the disk is I/O bound, adding cleaner threads will not help — you need faster storage.

Retention policy misconfiguration is a common source of disk exhaustion. Time-based retention (log.retention.hours) and size-based retention (log.retention.bytes) interact: Kafka enforces whichever threshold is hit first. A topic configured for 7-day retention with no size limit on a high-throughput cluster can consume terabytes before the time window triggers.

Infrastructure and OS metrics monitoring

Kafka is sensitive to the underlying host in ways that most applications are not.

Disk I/O. Kafka is a disk-bound workload. It relies on the OS page cache for low-latency reads and sequential writes. Monitor iowait and disk read/write throughput at the OS level. Elevated iowait correlates directly with increased LocalTimeMs in broker request latency.

Network. With a replication factor of 3, every byte written to a broker is retransmitted twice to followers. Monitor both inbound and outbound interface utilisation. On cloud instances, also watch for traffic-shaping events (AWS CloudWatch TrafficShaping metric): cloud providers will silently throttle network traffic when a burstable instance hits its baseline, causing latency spikes that are difficult to attribute without this metric.

Page cache. The OS page cache is Kafka's primary read acceleration mechanism. Consumer groups that are actively tailing the log read from page cache, not from disk. A consumer group that falls significantly behind forces the broker into cold disk reads, which evict hot data from the page cache and degrade all other consumers on that broker. Monitor OS-level memory pressure: if the page cache is under pressure, iowait and request latency will climb together.

File descriptors. Kafka opens one file descriptor per log segment. In a cluster with many topics and partitions, file descriptor exhaustion is a real failure mode. Set nofile limits to at least 100,000 per process and monitor current FD usage.

Memory swappiness. Set vm.swappiness = 1 to keep JVM heap pages in physical RAM. Swapping JVM pages to disk during garbage collection cycles introduces severe disk I/O latency, compounding pause times.

Kafka Connect monitoring

Kafka Connect is a separate JVM process with its own monitoring surface.

Connector and task state. The primary health signal for Connect is the connector and task status. A connector can be in RUNNING, PAUSED, or FAILED state. Tasks have the same states, plus UNASSIGNED. A FAILED task that does not recover automatically requires investigation of the Connect worker log.

Key Connect metrics:

Metric JMX MBean path Notes
connector-status kafka.connect:type=connector-metrics,connector=(...) Check for FAILED state
task-status kafka.connect:type=connector-task-metrics,connector=(...),task=(...) Track FAILED and UNASSIGNED tasks
connector-total-record-errors kafka.connect:type=connector-metrics,connector=(...) Error count per connector
source-record-write-rate kafka.connect:type=source-task-metrics,connector=(...),task=(...) Write rate to Kafka topics
sink-record-send-rate kafka.connect:type=sink-task-metrics,connector=(...),task=(...) Delivery rate to downstream system
put-batch-max-time-ms kafka.connect:type=sink-task-metrics,connector=(...),task=(...) P99 write latency to the sink

For source connectors, monitor the lag between the source system and the Kafka topic. For database CDC connectors, this is typically the replication slot lag in the source database.

For sink connectors, monitor the offset commit latency (offset-commit-success-percentage) and any increase in task restart counts. A task that restarts repeatedly is often encountering a poison-pill record in the topic, a downstream write failure that is not being handled by the dead-letter queue configuration, or a connection timeout to the downstream system.

Error records delivered to dead-letter queues should also be monitored. A growing DLQ depth indicates that the connector is encountering records it cannot process, which may signal a schema change, a data quality issue, or a configuration mismatch.

Kafka Streams monitoring

Kafka Streams applications expose metrics under kafka.streams MBeans.

Thread and task state:

Metric JMX MBean path What to watch
thread-state kafka.streams:type=stream-thread-metrics,thread-id=(...) Alert if any thread is in DEAD or PARTITIONS_REVOKED state
process-rate kafka.streams:type=stream-thread-metrics,thread-id=(...) Processing rate; a drop indicates a downstream bottleneck
commit-latency-avg kafka.streams:type=stream-thread-metrics,thread-id=(...) Latency to commit processed records; a spike often indicates state store pressure

For stateful topologies with RocksDB state stores:

Metric Notes
rocksdb-total-block-cache-usage Track memory usage; exhaustion causes state store reads to hit disk
rocksdb-estimate-num-keys Useful for capacity planning and detecting key accumulation
rocksdb-compaction-pending Pending compaction indicates write pressure on the state store

The most common operational issue in Kafka Streams is excessive state store size causing the application to lag behind its input topic. Monitor the processing rate (process-rate) against the input topic's MessagesInPerSec. If the Streams application is consistently slower than its input, the topology needs more parallelism, or the state store operations need to be profiled.

Application and business-level monitoring

Infrastructure metrics tell you that the cluster is running. Business-level metrics tell you whether it is doing what you need it to do.

For most teams, the most useful business-level signal derived from Kafka metrics is end-to-end latency: the time between a record being written by a producer and that record being processed by a consumer. This is consumer lag expressed as wall-clock time rather than offset distance.

How to build this:

  1. Produce a timestamp in the record metadata (or use the Kafka producer timestamp if reliable)
  2. At the consumer, calculate now - record.timestamp for each processed record
  3. Track this as a histogram; alert on P95 and P99, not just the mean

Concrete examples of business-level Kafka observability:

  • An order processing pipeline: lag in the order-confirmed topic expressed as seconds of latency between order placement and confirmation email dispatch
  • A fraud detection pipeline: lag in the transactions topic as a direct multiplier on fraud window size — a 5-minute lag means 5 additional minutes of unchecked transactions
  • A real-time dashboard: lag in the events topic as the staleness of displayed data

Building a "Kafka health to business outcome" dashboard that shows consumer lag alongside the downstream metric it affects gives on-call engineers the context to treat lag alerts with appropriate urgency. When it is clear that 2 minutes of lag on the payment-events topic means 2 minutes of delayed fraud detection, the alert becomes easier to prioritise correctly.

Alerting and incident detection

The following table covers the minimum required alert set for a production Kafka cluster. Use this as a starting point and adjust thresholds based on your SLOs.

Metric Condition Severity Rationale
OfflinePartitionsCount > 0 Page Partitions are completely unreachable; client reads and writes fail immediately
ActiveControllerCount (cluster sum) Not equal to 1 for 5 minutes Page Split-brain or total controller loss; metadata operations halted
UnderReplicatedPartitions > 0 sustained for 10+ minutes Page Durability degraded; one more failure risks data loss
UncleanLeaderElectionsPerSec > 0 Page Data loss has occurred; investigation required
UnderMinIsrPartitionCount > 0 Page Producers with acks=all are failing; client-visible write errors
IsrShrinksPerSec Sustained positive rate Warning Leading indicator of under-replication; investigate cause
RequestHandlerAvgIdlePercent < 0.3 (30%) sustained Warning Request queue building up; latency degradation imminent
RequestHandlerAvgIdlePercent < 0.2 (20%) Page Severe I/O saturation; client timeouts likely
NetworkProcessorAvgIdlePercent < 0.3 (30%) Warning Network thread saturation; connection drops possible
GC pause P99 > 200ms Warning Risk of ISR shrink on affected broker
GC pause P99 > 500ms Page ISR shrink likely; acks=all writes at risk
JVM heap post-GC > 70% of max heap Warning Increasing GC pressure; Full GC risk
JVM heap post-GC > 85% of max heap Page Imminent OutOfMemoryError or stop-the-world Full GC
Disk capacity > 70% Warning Review retention configuration; add capacity
Disk capacity > 85% Page Disk exhaustion will crash the broker
Consumer lag rate of change Growing trend for 5+ minutes Warning Consumer falling behind; investigate root cause
Consumer time-lag Exceeds per-topic SLO threshold Page Downstream SLA breach in progress
metadata-apply-error-count (KRaft) > 0 Page Metadata corruption on broker; restart required
EventQueueSize (KRaft) > 100 Warning Controller overloaded with administrative tasks

Alert philosophy notes:

  • Alert on OfflinePartitions and UncleanLeaderElections unconditionally — there is no benign explanation for these conditions in a healthy cluster.
  • Alert on UnderReplicatedPartitions only after a sustained window (5-10 minutes) to avoid false positives during normal broker restarts and rolling updates.
  • Avoid alerting on consumer lag using a raw offset count. Express it relative to throughput (time-based) or use a trend-based approach like Burrow.
  • IsrShrinksPerSec is a leading indicator; UnderReplicatedPartitions is a lagging one. Alert on both, at different severities.

Capacity planning and forecasting

Capacity planning for Kafka requires tracking how quickly the cluster is consuming its three primary resources: network, disk, and broker count.

Disk capacity projection. The formula for disk required per broker is:

(ingestion_rate × replication_factor × retention_period) / broker_count

At 500 MB/s ingestion with replication factor 3 and 7-day retention across 6 brokers: 500 MB/s × 3 × 604,800s / 6 = ~150 TB raw storage per broker.

Track the rate of disk growth via BytesInPerSec and project forward at current growth rate. Alert when you have less than 30 days of headroom at current trajectory.

Partition count and broker density. In KRaft mode, the historical ZooKeeper-era limit of roughly 4,000 partitions per broker and 200,000 partitions per cluster no longer applies. KRaft's Raft-based metadata management supports millions of partitions. However, each partition still consumes file descriptors, open segment handles, and memory for replication tracking. A practical upper bound of 10,000-20,000 partitions per broker remains sensible for most workloads.

Rebalancing cost. When adding brokers to expand capacity, Kafka does not automatically redistribute partitions. Use Cruise Control or kafka-reassign-partitions.sh to generate and execute reassignment plans. Always throttle reassignment throughput using --throttle <bytes_per_second> to prevent replication traffic from saturating broker network interfaces during business hours.

Kafka monitoring tools

The choice of monitoring tool affects both what you get out of the box and how much engineering is required to reach full coverage.

Tool Ingestion method Consumer lag Dashboard setup Broker footprint
Prometheus + Grafana Pull via JMX Exporter sidecar agent Requires separate kafka-exporter deployment Manual JSON dashboard import Low
Datadog Pull via JMXFetch daemon (350 metric limit) Native via Data Streams Monitoring Pre-packaged SaaS dashboards Moderate
Dynatrace Process-bound OneAgent injection Native via OneAgent introspection Out-of-the-box Kafka Overview dashboard Low
Confluent Control Center Stream-based; requires Metrics Reporter JAR on every broker Native in Normal mode Proprietary; no external integrations Very high (8-16 GB heap)
Kpow Native AdminClient API polling — no JMX required Native via AdminClient offset polling Pre-packaged native UI Near-zero (no broker-side agent)

What requires custom configuration in each tool:

  • Prometheus/Grafana: Consumer lag (separate kafka-exporter), custom MBean patterns in the kafka-2_0_0.yml YAML rule file, Grafana dashboard variable mapping. JMX histogram metrics are converted to static gauges, losing cumulative sum data needed for accurate rate calculations.
  • Datadog: JMX-enabled agent image (-jmx tag), pod annotations for Autodiscovery, custom metrics.yaml filter to stay within the 350-metric-per-instance limit. AWS MSK clusters require a separate CloudWatch integration — the standard Kafka agent check cannot query MSK directly.
  • Dynatrace: Topic name filters to prevent metric cardinality explosion (and the associated Davis Data Unit billing overage), and a custom plugin.json schema for non-standard MBeans. The Davis AI anomaly detection engine baselines normal behaviour automatically, which is particularly valuable for consumer lag patterns that vary with traffic shape.
  • Confluent Control Center: Proprietary Metrics Reporter JAR on every broker classpath, rolling cluster restart to activate. Legacy versions (CP 7.x) use client-side monitoring interceptors that create roughly 50 internal topics and significant metadata overhead. CP 8.0 migrates to Prometheus-based pull collection, requiring a migration if upgrading from the interceptor model. Not compatible with non-Confluent distributions (OSS Kafka, MSK, Aiven).
  • Kpow: No JMX or broker-side agent. Kpow queries the AdminClient API directly, eliminating JMX serialisation overhead on the broker. Streams topology visualisation requires embedding the kpow-streams-agent dependency in the application code. Prometheus egress requires enabling PROMETHEUS_EGRESS=true and configuring basic authentication.

For organisations that need to minimise broker resource overhead while maintaining full observability, running Prometheus paired with Kpow is a practical architecture: Kpow's metadata engine eliminates JMX sidecars on the brokers, and its Prometheus egress endpoint feeds clean telemetry into a centralised Prometheus TSDB. You can try Kpow free for 30 days — it connects to any Kafka cluster in minutes and can be deployed via Docker, Helm, or JAR.

Summary

Kafka monitoring is not a single dashboard. It is a set of overlapping layers: broker internals, replication state, JVM health, consumer lag, producer behaviour, storage capacity, and OS resources. Each layer has distinct failure modes, and a blind spot in any one of them can become a production incident.

The minimum viable monitoring set for any production cluster:

  1. Alert immediately on OfflinePartitionsCount > 0 and UncleanLeaderElectionsPerSec > 0
  2. Alert after a sustained window on UnderReplicatedPartitions > 0
  3. Track IsrShrinksPerSec as a leading indicator of durability degradation
  4. Use time-based consumer lag thresholds, not raw offset counts
  5. Monitor JVM GC pause duration as a direct predictor of ISR instability
  6. Watch RequestHandlerAvgIdlePercent and NetworkProcessorAvgIdlePercent for broker saturation
  7. For KRaft clusters, add quorum health metrics from kafka.server:type=raft-metrics

The rest — per-topic metrics, Connect and Streams coverage, business-level lag thresholds — is built on top of this foundation.