Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

A detailed guide to Kafka producer monitoring

Table of contents

Factor House
June 3rd, 2026
xx min read

Key takeaways

  • The Kafka Java producer client exposes metrics through JMX under the kafka.producer:type=producer-metrics MBean, covering throughput, latency, buffering, error rates, and connection state.
  • record-error-rate and bufferpool-wait-time are the two most important metrics to alert on. The first signals delivery failures; the second signals back-pressure that can stall application threads.
  • Producer metrics should be correlated with broker-side metrics. Elevated request-latency-avg on the client often originates from broker disk saturation, replication lag, or quota throttling rather than from the producer itself.
  • Configuration choices — acks, linger.ms, batch.size, max.in.flight.requests.per.connection, and compression.type — directly affect observable metrics and should inform how you interpret them.
  • This article covers the Apache Kafka Java producer client, targeting Kafka 3.x (with notes on 2.x where relevant). Clients based on librdkafka, used by Python, Go, and C producers, expose a different metric surface and are not covered here.

What is Kafka producer monitoring?

Kafka producer monitoring is the practice of observing the runtime state of a Kafka producer client: how many records it is sending, how quickly, whether any are failing, and whether back-pressure from the broker is affecting the application.

The Apache Kafka Java producer client uses an asynchronous two-thread architecture. When your application calls producer.send(), records are serialized and written to the RecordAccumulator, a memory buffer that groups records into batches by partition. A background Sender thread then drains those batches over persistent TCP connections to the appropriate broker leaders.

This design decouples your application thread from network I/O, but it also means failures and bottlenecks can be indirect. A broker that is slow to acknowledge writes does not immediately cause send() to fail. Instead, batches accumulate in the buffer until buffer.memory is exhausted, at which point your application thread blocks. Understanding this cascade is central to interpreting producer metrics correctly.

Key producer metrics to monitor

All Java producer metrics are exposed through JMX. Producer-level metrics use the MBean pattern:

kafka.producer:type=producer-metrics,client-id={client-id}

Topic-scoped metrics are available separately under:

kafka.producer:type=producer-topic-metrics,client-id={client-id},topic={topic}

The metric names below apply to Kafka 3.x. Most were introduced in Kafka 2.x and are consistent across both versions.

Throughput metrics

Metric JMX attribute Description Why it matters
`record-send-rate` `record-send-rate` Records sent per second to a specific topic Baseline for producer health; a drop to zero on a running producer indicates a stall
`byte-rate` `byte-rate` Bytes sent per second to a specific topic Measures network load; correlate with broker `BytesInPerSec` to confirm delivery

Both of these are topic-scoped metrics available under the producer-topic-metrics MBean and require a topic label in your query.

Latency metrics

Metric JMX attribute Description Why it matters
`request-latency-avg` `request-latency-avg` Average round-trip time in ms for produce requests Rising values indicate broker degradation, replication lag, or network issues
`request-latency-max` `request-latency-max` Maximum observed produce request latency in ms Useful for detecting outlier latency, particularly under `acks=all`
`record-queue-time-avg` `record-queue-time-avg` Average time in ms that a batch spends in the RecordAccumulator before being sent Indicates whether the Sender thread is keeping up with the application; target below 50ms under normal conditions

Error and retry metrics

Metric JMX attribute Description Why it matters
`record-error-rate` `record-error-rate` Failed record sends per second Should be 0.0 at all times; any sustained non-zero value means records are being dropped
`record-retry-rate` `record-retry-rate` Record send retries per second Brief spikes are expected during leader elections; sustained elevation suggests persistent instability

Batching efficiency metrics

Metric JMX attribute Description Why it matters
`batch-size-avg` `batch-size-avg` Average batch size in bytes at the time of send Compare to your `batch.size` config; significantly lower values suggest the producer is flushing batches too early
`records-per-request-avg` `records-per-request-avg` Average number of records per produce request High values indicate efficient batching; low values under heavy load suggest `linger.ms` is too short
`compression-rate-avg` `compression-rate-avg` Average compression ratio across sent batches Lower values mean better compression; useful for comparing before and after a `compression.type` change

Buffer and back-pressure metrics

Metric JMX attribute Description Why it matters
`buffer-available-bytes` `buffer-available-bytes` Remaining send buffer capacity in bytes Should stay close to the configured `buffer.memory`; dropping toward zero means the producer is accumulating faster than the Sender thread can drain
`bufferpool-wait-time` `bufferpool-wait-time` Fraction of time the appender thread is waiting for buffer space Target: 0.0; any sustained non-zero value indicates back-pressure; values above 0.10 warrant investigation
`waiting-threads` `waiting-threads` Number of application threads currently blocked in `send()` Target: 0; any non-zero value means application threads are stalling, which can cascade into upstream service timeouts

Connection metrics

Metric JMX attribute Description Why it matters
`connection-count` `connection-count` Number of active TCP connections to brokers Should equal the number of brokers the producer is actively writing to
`io-wait-time-ns-avg` `io-wait-time-ns-avg` Average nanoseconds the Sender thread spent waiting for ready sockets A very high value may indicate low throughput; a near-zero value means the thread is continuously busy
`connection-creation-rate` `connection-creation-rate` New TCP connections established per second Elevated rates suggest connections are being dropped and re-established, typically due to network instability
`connection-close-rate` `connection-close-rate` Sockets closed per second Elevated alongside `connection-creation-rate` indicates connection flapping

Producer configuration effects on metrics

The configuration you pass to the producer at startup has a direct and measurable effect on the metrics you observe at runtime. Understanding these relationships helps you interpret metric values correctly and tune configuration to meet your performance objectives.

acks setting

The acks parameter controls how many broker acknowledgments the producer waits for before considering a record successfully delivered.

  • acks=0: the producer does not wait for any acknowledgment. request-latency-avg will be minimal, but record-error-rate will not reflect broker-side failures — records can be lost silently without the client being aware.
  • acks=1: the leader broker acknowledges after writing to its local log. This produces moderate request-latency-avg values and captures leader-side failures.
  • acks=all (or acks=-1): the leader waits for all in-sync replicas to acknowledge before responding. This produces the highest request-latency-avg values — typically two to five times higher than acks=1 in a healthy cluster — but provides the strongest durability guarantee.

Under acks=all, any replication lag on follower replicas will appear directly in request-latency-avg. Correlate elevated latency values with the broker-side PurgatorySize metric to determine whether the delay originates from the replication layer.

Idempotence and transactions

Setting enable.idempotence=true provides exactly-once delivery semantics per partition. The broker assigns each producer a producer ID and tracks sequence numbers per partition, discarding any duplicates that result from retries.

Idempotent producers tend to show higher baseline record-retry-rate values than non-idempotent producers, because the client retries aggressively on transient failures. This is expected behavior; the broker deduplicates on its side.

When enable.idempotence=true is configured, max.in.flight.requests.per.connection must be 5 or lower. Exceeding this limit prevents the broker from guaranteeing ordering of sequence numbers across concurrent in-flight batches, which can produce OutOfOrderSequenceException errors that surface as elevated record-error-rate.

For transactional producers, two additional metrics are worth monitoring: flush-time-ns-total, which records cumulative nanoseconds spent blocked inside .flush() during transactional boundary commits, and txn-send-offsets-time-ns-total, which tracks time spent publishing offset maps within transactions.

linger.ms and batch.size

linger.ms controls how long the Sender thread waits before dispatching an incomplete batch. batch.size sets the maximum batch size in bytes.

A higher linger.ms value — for example, 50 to 100ms — allows more records to accumulate per batch. You will observe higher batch-size-avg and records-per-request-avg, and lower network overhead per record. The trade-off is increased record-queue-time-avg, since records wait longer in the accumulator before being sent.

A lower linger.ms value — 0 to 5ms — minimises record-queue-time-avg, which is appropriate for latency-sensitive workloads, but typically produces smaller batches and lower throughput per network request.

If batch-size-avg is consistently close to your configured batch.size, the producer is filling batches before the linger timer fires. This is a sign of a high-throughput workload where linger.ms has less influence on batch efficiency.

max.in.flight.requests.per.connection

This setting controls how many unacknowledged produce requests the Sender thread can pipeline on a single broker connection before blocking.

Higher values increase throughput by allowing more requests in flight simultaneously, but they introduce ordering risks. If a broker returns an error for request N and the producer retries, request N+1 may arrive before the retry completes, resulting in out-of-order delivery on that partition.

For idempotent producers, the broker's sequence number tracking handles this risk — but only up to 5 concurrent in-flight requests per connection. Beyond that, idempotence guarantees do not hold.

Monitor requests-in-flight alongside record-retry-rate. If retries are frequent and requests-in-flight is consistently at the configured maximum, reducing the in-flight limit may improve delivery stability at some cost to throughput.

compression.type

Compression is applied per batch before the batch is dispatched by the Sender thread. The available options are none, gzip, snappy, lz4, and zstd.

compression-rate-avg reflects the effectiveness of the chosen algorithm: a value of 0.4 means the compressed batch is 40% of its original size. Lower values indicate better compression, which reduces network bandwidth and broker disk usage.

The CPU cost varies significantly by algorithm. lz4 and snappy offer fast compression with moderate ratios, which suits latency-sensitive workloads. zstd and gzip achieve better compression ratios but are more CPU-intensive. If you observe higher request-latency-avg after enabling compression, check producer host CPU utilisation — the Sender thread may be CPU-bound.

Compression shifts CPU work from the broker to the producer at write time, and to the consumer at read time. Weigh the network and storage savings against the CPU cost on both sides.

buffer.memory and max.block.ms

buffer.memory sets the total size of the RecordAccumulator pool in bytes (default: 32 MB). max.block.ms sets how long an application thread will block in send() waiting for buffer space before throwing a TimeoutException.

When broker throughput drops — due to disk saturation, replication lag, or quota throttling — the Sender thread drains batches more slowly. Batches accumulate in the accumulator, buffer-available-bytes drops toward zero, and application threads begin to block. bufferpool-wait-time rises above zero. If a thread remains blocked for longer than max.block.ms, the call fails with a TimeoutException.

waiting-threads rising above zero is always a signal that the producer is experiencing back-pressure. Increasing buffer.memory gives more headroom before application threads block, but does not address the underlying cause of the back-pressure.

Collecting producer metrics

JMX collection from the producer process

The Kafka Java producer client registers JMX MBeans with the platform MBean server of its JVM automatically. No additional configuration is required for local JMX access.

To enable remote JMX access for external tooling, pass the following JVM flags when starting your producer application:

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false

For production environments, replace the authenticate=false and ssl=false flags with appropriate authentication and TLS configuration.

Once the JMX port is open, you can connect using tools like jconsole, jmxterm, or any JMX-compatible client to inspect MBeans under kafka.producer:type=producer-metrics,client-id={client-id}.

Prometheus JMX exporter

The most common production approach is to run the Prometheus JMX Exporter as a Java agent inside the producer JVM. It polls the local MBean server and exposes the metrics as a Prometheus-compatible HTTP endpoint, without requiring remote JMX access.

Add the agent to your JVM startup arguments:

export JAVA_OPTS="-javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent-0.20.0.jar=9404:/etc/prometheus/kafka-producer-jmx-rules.yml"

A rules configuration file that captures both producer-level and topic-level metrics:

lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
 # Producer-level metrics (throughput, latency, buffer, errors)
 - pattern: 'kafka.producer<type=producer-metrics, client-id=(.+)><>([^:]+):'
   name: kafka_producer_$2
   type: GAUGE
   labels:
     client_id: "$1"

 # Topic-level metrics (per-topic record-send-rate, byte-rate)
 - pattern: 'kafka.producer<type=producer-topic-metrics, client-id=(.+), topic=(.+)><>([^:]+):'
   name: kafka_producer_topic_$3
   type: GAUGE
   labels:
     client_id: "$1"
     topic: "$2"

This configuration exposes metrics like kafka_producer_record_error_rate{client_id="orders-service"} and kafka_producer_topic_record_send_rate{client_id="orders-service",topic="orders"} at http://localhost:9404/metrics.

In Strimzi-managed Kubernetes deployments, you can keep the JMX port (typically 9999) open for tooling while dedicating port 9404 to Prometheus scraping.

Multi-producer aggregation

When running multiple producer instances — across replicated services or multiple application pods — you need a strategy for identifying individual producers in your metrics pipeline.

Each producer instance is identified by its client.id configuration. Set this to something that includes both the service name and the instance identifier, for example orders-service-pod-0. This value becomes the client_id label in Prometheus, allowing you to filter and aggregate metrics per producer.

If client.id is left at its default value of producer-1, all instances in a fleet will appear identical in your metrics store. When a single instance begins experiencing elevated record-error-rate or bufferpool-wait-time, you will have no way to isolate it.

Tag producer metrics with at least:

  • client_id: the producer instance identifier
  • topic: the target topic (available on topic-scoped metrics)
  • Any additional labels your platform uses for environment, region, or service ownership

Micrometer and OpenTelemetry integration

Micrometer

If you are using Spring Kafka or another framework that integrates with Micrometer, the KafkaClientMetrics binder forwards JMX metrics to any Micrometer-supported registry. Metric names follow the same naming convention as the JMX attributes but are prefixed with kafka.producer. and use dots rather than underscores — for example, kafka.producer.record-error-rate instead of kafka_producer_record_error_rate.

OpenTelemetry

The OpenTelemetry JMX Metric Gatherer supports Kafka producers natively through its kafka-producer target system. The gatherer runs as a sidecar or separate process, queries the producer JVM over JMX/RMI, and pushes metrics to an OpenTelemetry Collector:

receivers:
 jmx/producer:
   jar_path: /opt/opentelemetry-jmx-metrics.jar
   endpoint: localhost:9999
   target_system: kafka-producer
   collection_interval: 10s

exporters:
 otlp:
   endpoint: otel-collector.monitoring.svc.cluster.local:4317
   tls:
     insecure: true

service:
 pipelines:
   metrics:
     receivers: [jmx/producer]
     exporters: [otlp]

The OTel target system exposes metric names prefixed with kafka.producer. — for example, kafka.producer.record-retry-rate and kafka.producer.request-latency-avg.

Producer health check script

The script below provides a point-in-time diagnostic for a running Kafka producer. It queries a Prometheus JMX Exporter endpoint and evaluates the key indicators of producer health against fixed thresholds.

Companion scripts covering broker-side and consumer-side checks are described in the Kafka broker monitoring article and the Kafka consumer monitoring article. The three scripts share the same pattern: query the metrics endpoint, evaluate against thresholds, and exit with a status code compatible with standard monitoring integrations.

Prerequisites:

  • The Prometheus JMX Exporter is running on the producer JVM (default port: 9404)
  • curl and awk are available in the shell environment

Thresholds:

Metric Severity Threshold
`record-error-rate` Critical > 0.0 per second
`bufferpool-wait-time` Critical > 0.10 (10% of appender time)
`waiting-threads` Critical > 0
`record-queue-time-avg` Warning > 500ms
`request-latency-avg` Warning > 200ms
`record-retry-rate` Warning > 10 per second
#!/usr/bin/env bash
# kafka-producer-health-check.sh
#
# Checks the health of a running Kafka producer by querying its
# Prometheus JMX Exporter endpoint.
#
# Usage:
#   ./kafka-producer-health-check.sh [client_id] [metrics_port]
#
# Arguments:
#   client_id    (optional) Filter output to a specific producer client.id
#   metrics_port (optional) JMX Exporter HTTP port. Default: 9404
#
# Exit codes:
#   0 = OK
#   1 = WARNING
#   2 = CRITICAL
#   3 = UNKNOWN

CLIENT_ID="${1:-}"
METRICS_PORT="${2:-9404}"
METRICS_URL="http://localhost:${METRICS_PORT}/metrics"

# Thresholds
CRIT_RECORD_ERROR_RATE="0.0"
CRIT_BUFFERPOOL_WAIT="0.10"
CRIT_WAITING_THREADS="0"
WARN_QUEUE_TIME_MS="500"
WARN_REQUEST_LATENCY_MS="200"
WARN_RETRY_RATE="10.0"

EXIT_CODE=0

fetch_metric() {
  local name="$1"
  local label_filter="$2"
  if [ -n "$label_filter" ]; then
    curl -sf "$METRICS_URL" | grep "^${name}{" | grep "${label_filter}" | awk '{print $2}' | head -1
  else
    curl -sf "$METRICS_URL" | grep "^${name}" | awk '{print $2}' | head -1
  fi
}

check() {
  local label="$1"
  local value="$2"
  local threshold="$3"
  local op="$4"
  local severity="$5"

  if [ -z "$value" ]; then
    printf "%-10s %s (metric not found)\n" "UNKNOWN" "$label"
    [ "$EXIT_CODE" -lt 3 ] && EXIT_CODE=3
    return
  fi

  local triggered
  triggered=$(awk "BEGIN { print (${value} ${op} ${threshold}) ? 1 : 0 }")

  if [ "$triggered" -eq 1 ]; then
    printf "%-10s %s = %s  [threshold %s %s]\n" "$severity" "$label" "$value" "$op" "$threshold"
    [ "$severity" = "CRITICAL" ] && EXIT_CODE=2
    [ "$severity" = "WARNING"  ] && [ "$EXIT_CODE" -lt 2 ] && EXIT_CODE=1
  else
    printf "%-10s %s = %s\n" "OK" "$label" "$value"
  fi
}

LABEL_FILTER=""
[ -n "$CLIENT_ID" ] && LABEL_FILTER="client_id=\"${CLIENT_ID}\""

echo "Kafka producer health check"
echo "Endpoint:  ${METRICS_URL}"
echo "Client ID: ${CLIENT_ID:-all}"
echo "---"

check "record-error-rate" \
  "$(fetch_metric kafka_producer_record_error_rate    "$LABEL_FILTER")" \
  "$CRIT_RECORD_ERROR_RATE"   ">" "CRITICAL"

check "bufferpool-wait-time" \
  "$(fetch_metric kafka_producer_bufferpool_wait_time "$LABEL_FILTER")" \
  "$CRIT_BUFFERPOOL_WAIT"     ">" "CRITICAL"

check "waiting-threads" \
  "$(fetch_metric kafka_producer_waiting_threads      "$LABEL_FILTER")" \
  "$CRIT_WAITING_THREADS"     ">" "CRITICAL"

check "record-queue-time-avg (ms)" \
  "$(fetch_metric kafka_producer_record_queue_time_avg "$LABEL_FILTER")" \
  "$WARN_QUEUE_TIME_MS"       ">" "WARNING"

check "request-latency-avg (ms)" \
  "$(fetch_metric kafka_producer_request_latency_avg  "$LABEL_FILTER")" \
  "$WARN_REQUEST_LATENCY_MS"  ">" "WARNING"

check "record-retry-rate" \
  "$(fetch_metric kafka_producer_record_retry_rate    "$LABEL_FILTER")" \
  "$WARN_RETRY_RATE"          ">" "WARNING"

AVAIL=$(fetch_metric kafka_producer_buffer_available_bytes "$LABEL_FILTER")
TOTAL=$(fetch_metric kafka_producer_buffer_total_bytes     "$LABEL_FILTER")
if [ -n "$AVAIL" ] && [ -n "$TOTAL" ] && [ "$TOTAL" != "0" ]; then
  USED_PCT=$(awk "BEGIN { printf \"%.1f\", (1 - ${AVAIL} / ${TOTAL}) * 100 }")
  printf "%-10s %s = %s%%  (%s of %s bytes available)\n" \
    "INFO" "buffer-utilisation" "$USED_PCT" "$AVAIL" "$TOTAL"
fi

echo "---"
case "$EXIT_CODE" in
  0) echo "Result: OK"       ;;
  1) echo "Result: WARNING"  ;;
  2) echo "Result: CRITICAL" ;;
  3) echo "Result: UNKNOWN"  ;;
esac

exit "$EXIT_CODE"

Limitations: This script evaluates one point in time. It does not detect trends, multi-instance comparisons, or intermittent spikes. It also does not query broker-side metrics — a clean result here does not rule out a broker problem contributing to producer behaviour. Use it for incident investigation or pre-deployment checks, not as a substitute for continuous alerting.

Alerting strategy for Kafka producer monitoring

The table below describes a baseline alert set for Kafka producer monitoring. For each alert, the condition reflects sustained behaviour rather than instantaneous spikes, which reduces false positives during normal transient events like leader elections.

Alert Metric Condition Likely cause Suggested response
Delivery failures `record-error-rate` > 0 sustained for 2 minutes Broker unavailable, serialization error, authorization failure Check producer logs for exception type; verify broker health and topic ACLs
Producer back-pressure `bufferpool-wait-time` > 0.10 for 1 minute Broker too slow to accept records; send buffer exhausted Correlate with broker `TotalTimeMs` and `PurgatorySize`; consider temporarily increasing `buffer.memory` while diagnosing
Application threads blocked `waiting-threads` > 0 for 30 seconds Buffer memory depleted under sustained back-pressure Same as above; blocked application threads can cascade into upstream service timeouts
Rising request latency `request-latency-avg` > 200ms for 5 minutes Broker disk degradation, replication lag, or quota throttling Check broker `TotalTimeMs` component breakdown; inspect ISR count and `PurgatorySize`
Poor batching efficiency `batch-size-avg` Consistently below 25% of configured `batch.size` Producer sending too frequently relative to `linger.ms` Increase `linger.ms`; verify that the producer is handling meaningful load
Silent producer `record-send-rate` = 0 with producer process running Application-level stall, topic ACL block, or misconfiguration Check application logs; verify the topic exists and the producer has write permission

For use with Prometheus Alertmanager, the following rule definitions cover the most critical conditions:

groups:
 - name: kafka-producer-alerts
   rules:
     - alert: ProducerDeliveryFailures
       expr: rate(kafka_producer_record_error_total[5m]) > 0.1
       for: 2m
       labels:
         severity: critical
       annotations:
         summary: "Producer {{ $labels.client_id }} is dropping records"
         description: >
           The record error rate on client {{ $labels.client_id }} has exceeded 0.1 errors/second
           for 2 minutes. Records are failing to deliver and may be lost.

     - alert: ProducerBackPressure
       expr: rate(kafka_producer_bufferpool_wait_time_total[2m]) > 0.15
       for: 1m
       labels:
         severity: critical
       annotations:
         summary: "Producer {{ $labels.client_id }} buffer exhaustion"
         description: >
           Memory allocation wait rate for client {{ $labels.client_id }} has exceeded 15% over
           2 minutes. Application threads are blocking on .send() calls.

     - alert: ProducerQueueDelay
       expr: kafka_producer_record_queue_time_avg > 500
       for: 5m
       labels:
         severity: warning
       annotations:
         summary: "Producer {{ $labels.client_id }} accumulator queue delay"
         description: >
           Average record queue wait time has exceeded 500ms for 5 minutes. This may indicate
           broker degradation, network bottlenecks, or quota throttling.

     - alert: ProducerHighRetryRate
       expr: rate(kafka_producer_record_retry_total[5m]) > 10.0
       for: 3m
       labels:
         severity: warning
       annotations:
         summary: "High retry rate on producer {{ $labels.client_id }}"
         description: >
           The record retry rate is exceeding 10 retries/second for 3 minutes. This suggests
           transient network instability or a cluster rebalancing event.

The thresholds above are starting points rather than universal standards. A request-latency-avg of 200ms may be acceptable for a batch analytics pipeline but not for a payment processing service. Adjust thresholds to match your SLOs and your cluster's normal operating range.

Common Kafka producer issues and how to diagnose

Producer stalling due to full send buffer

Symptom: Application threads block on send(), eventually throwing TimeoutException after max.block.ms elapses.

Metric signals: bufferpool-wait-time approaching or above 0.10; buffer-available-bytes near zero; waiting-threads above 0; record-queue-time-avg elevated.

Diagnostic steps:

  1. Confirm that buffer-available-bytes is depleted and bufferpool-wait-time is non-zero. This pattern confirms back-pressure rather than a serialization issue — serialization failures drop records before they reach the buffer, leaving buffer-available-bytes near its maximum.
  2. Check broker TotalTimeMs and its component latencies. If the local log append component (t_local) is elevated, the broker disk may be saturated. If the replication wait component (t_remote) is elevated, there may be ISR lag.
  3. Check broker RequestQueueSize. A growing queue indicates that broker network threads are saturated and cannot process incoming requests fast enough.
  4. If the broker is healthy, review batch-size-avg relative to batch.size. An oversized batch.size can cause the buffer to fill faster than the Sender thread can drain it, particularly at low throughput where batches take longer to fill.

Retry storm from leader election

Symptom: A spike in record-retry-rate and request-latency-avg, typically coinciding with a broker restart or failure event.

Metric signals: record-retry-rate spikes; request-latency-avg increases; record-error-rate may briefly rise before returning to zero.

Cause: During a partition leader election, brokers return NotLeaderForPartitionException for produce requests targeting the old leader. The Kafka client retries these requests automatically once metadata is refreshed and a new leader is elected.

Brief retry spikes during failover are expected and do not indicate data loss, as long as record-error-rate returns to zero after the election completes and record-retry-rate subsides. If retries persist, check that retries and retry.backoff.ms are set appropriately for your cluster's typical election duration, and verify that the affected topic's ISR is healthy.

Idempotent producer failure under high in-flight requests

Symptom: The broker returns OutOfOrderSequenceException; the producer surfaces this as a delivery failure.

Metric signals: record-error-rate spikes; record-retry-rate may be elevated.

Cause: When enable.idempotence=true is set, the broker tracks sequence numbers per partition to detect and discard duplicates. If max.in.flight.requests.per.connection exceeds 5, the broker cannot reliably order sequence numbers across concurrent in-flight batches, which produces this exception.

To resolve this, set max.in.flight.requests.per.connection to 5 or lower when enable.idempotence=true is configured. From Kafka 3.0 onward, the client enforces this constraint automatically when idempotence is explicitly enabled; earlier versions do not.

Serialization errors surfacing as send failures

Symptom: record-error-rate spikes, but broker metrics and network metrics are unchanged.

Metric signals: record-error-rate elevated; request-latency-avg, record-queue-time-avg, and broker-side metrics all remain at normal levels; buffer-available-bytes stays near its maximum.

Cause: Serialization failures occur before a record is written to the RecordAccumulator. If the serializer throws an exception — for example, when a value does not conform to an Avro schema — the record is never queued and no network activity occurs. This is the key distinction from network or broker failures, where buffer utilisation also rises.

Check producer logs for SerializationException or schema registry errors. If you are using Confluent Schema Registry, a schema compatibility violation will produce this pattern: record-error-rate rises while all broker and buffer metrics remain stable.

Compression CPU overhead causing latency regression

Symptom: request-latency-avg increases after enabling or changing compression, even though broker health is unchanged.

Metric signals: request-latency-avg elevated; compression-rate-avg low (indicating good compression, but at a CPU cost); producer host CPU utilisation elevated.

Cause: Compression is performed by the Sender thread before each batch is dispatched. On a CPU-constrained host, heavyweight algorithms such as gzip or zstd can slow the Sender thread enough to increase observed request latency, even though the broker is responding quickly.

To diagnose, compare request-latency-avg before and after enabling compression, and monitor producer host CPU. If compression is causing a latency regression, consider switching to lz4, which typically offers a better throughput-to-CPU ratio than gzip or zstd for latency-sensitive workloads. Weigh the network bandwidth and broker disk savings against the producer CPU cost before deciding.

Best practices for Kafka producer monitoring

  • Monitor record-error-rate and bufferpool-wait-time as the minimum baseline for every producer. These two metrics surface the most operationally significant failure conditions — delivery failures and upstream stalling — and should be alerting targets in any environment.
  • Set acks=all and enable.idempotence=true for any workload where data loss is unacceptable, and account for the resulting increase in request-latency-avg when setting alert thresholds. A latency threshold calibrated for acks=1 will produce false positives under acks=all.
  • Use linger.ms greater than 0 in throughput-optimised workloads, and verify the effect by monitoring batch-size-avg and records-per-request-avg. If batch sizes do not increase after raising linger.ms, the producer may already be limited by batch.size or by low topic throughput.
  • Set client.id explicitly on every producer instance. In environments running multiple producer instances, undifferentiated client IDs make it impossible to isolate a misbehaving instance when record-error-rate or bufferpool-wait-time rises in your aggregated metrics.
  • Align alert thresholds with your service-level objectives rather than generic defaults. The threshold values in this article are reasonable starting points, but what constitutes a normal or degraded baseline varies by workload type and cluster configuration.
  • Correlate producer metrics with broker and consumer metrics before concluding that a producer is the source of a problem. Elevated request-latency-avg is frequently a symptom of broker disk saturation, replication lag, or quota throttling. For a complete picture, refer to the Kafka broker monitoring article and the Kafka consumer monitoring article.
  • For transactional producers, extend your monitoring to include flush-time-ns-total and txn-send-offsets-time-ns-total. Spikes in these values indicate bottlenecks in the transactional commit path that would not appear in the standard metric set but can cause application threads to stall.

Monitor Kafka producers with Factor House

Kpow provides real-time visibility into your Kafka producers without requiring a separate metrics pipeline. You can inspect producer metrics, configuration, and per-topic throughput directly from the UI, alongside broker and consumer monitoring in the same tool.

Give Kpow a try for yourself with a free 30-day trial. You can connect it to any Kafka cluster in minutes and deploy it via Docker, Helm, or JAR.