
A detailed guide to Kafka producer monitoring
Table of contents
Key takeaways
- The Kafka Java producer client exposes metrics through JMX under the
kafka.producer:type=producer-metricsMBean, covering throughput, latency, buffering, error rates, and connection state. record-error-rateandbufferpool-wait-timeare the two most important metrics to alert on. The first signals delivery failures; the second signals back-pressure that can stall application threads.- Producer metrics should be correlated with broker-side metrics. Elevated
request-latency-avgon the client often originates from broker disk saturation, replication lag, or quota throttling rather than from the producer itself. - Configuration choices —
acks,linger.ms,batch.size,max.in.flight.requests.per.connection, andcompression.type— directly affect observable metrics and should inform how you interpret them. - This article covers the Apache Kafka Java producer client, targeting Kafka 3.x (with notes on 2.x where relevant). Clients based on librdkafka, used by Python, Go, and C producers, expose a different metric surface and are not covered here.
What is Kafka producer monitoring?
Kafka producer monitoring is the practice of observing the runtime state of a Kafka producer client: how many records it is sending, how quickly, whether any are failing, and whether back-pressure from the broker is affecting the application.
The Apache Kafka Java producer client uses an asynchronous two-thread architecture. When your application calls producer.send(), records are serialized and written to the RecordAccumulator, a memory buffer that groups records into batches by partition. A background Sender thread then drains those batches over persistent TCP connections to the appropriate broker leaders.
This design decouples your application thread from network I/O, but it also means failures and bottlenecks can be indirect. A broker that is slow to acknowledge writes does not immediately cause send() to fail. Instead, batches accumulate in the buffer until buffer.memory is exhausted, at which point your application thread blocks. Understanding this cascade is central to interpreting producer metrics correctly.
Key producer metrics to monitor
All Java producer metrics are exposed through JMX. Producer-level metrics use the MBean pattern:
kafka.producer:type=producer-metrics,client-id={client-id}
Topic-scoped metrics are available separately under:
kafka.producer:type=producer-topic-metrics,client-id={client-id},topic={topic}
The metric names below apply to Kafka 3.x. Most were introduced in Kafka 2.x and are consistent across both versions.
Throughput metrics
Both of these are topic-scoped metrics available under the producer-topic-metrics MBean and require a topic label in your query.
Latency metrics
Error and retry metrics
Batching efficiency metrics
Buffer and back-pressure metrics
Connection metrics
Producer configuration effects on metrics
The configuration you pass to the producer at startup has a direct and measurable effect on the metrics you observe at runtime. Understanding these relationships helps you interpret metric values correctly and tune configuration to meet your performance objectives.
acks setting
The acks parameter controls how many broker acknowledgments the producer waits for before considering a record successfully delivered.
acks=0: the producer does not wait for any acknowledgment.request-latency-avgwill be minimal, butrecord-error-ratewill not reflect broker-side failures — records can be lost silently without the client being aware.acks=1: the leader broker acknowledges after writing to its local log. This produces moderaterequest-latency-avgvalues and captures leader-side failures.acks=all(oracks=-1): the leader waits for all in-sync replicas to acknowledge before responding. This produces the highestrequest-latency-avgvalues — typically two to five times higher thanacks=1in a healthy cluster — but provides the strongest durability guarantee.
Under acks=all, any replication lag on follower replicas will appear directly in request-latency-avg. Correlate elevated latency values with the broker-side PurgatorySize metric to determine whether the delay originates from the replication layer.
Idempotence and transactions
Setting enable.idempotence=true provides exactly-once delivery semantics per partition. The broker assigns each producer a producer ID and tracks sequence numbers per partition, discarding any duplicates that result from retries.
Idempotent producers tend to show higher baseline record-retry-rate values than non-idempotent producers, because the client retries aggressively on transient failures. This is expected behavior; the broker deduplicates on its side.
When enable.idempotence=true is configured, max.in.flight.requests.per.connection must be 5 or lower. Exceeding this limit prevents the broker from guaranteeing ordering of sequence numbers across concurrent in-flight batches, which can produce OutOfOrderSequenceException errors that surface as elevated record-error-rate.
For transactional producers, two additional metrics are worth monitoring: flush-time-ns-total, which records cumulative nanoseconds spent blocked inside .flush() during transactional boundary commits, and txn-send-offsets-time-ns-total, which tracks time spent publishing offset maps within transactions.
linger.ms and batch.size
linger.ms controls how long the Sender thread waits before dispatching an incomplete batch. batch.size sets the maximum batch size in bytes.
A higher linger.ms value — for example, 50 to 100ms — allows more records to accumulate per batch. You will observe higher batch-size-avg and records-per-request-avg, and lower network overhead per record. The trade-off is increased record-queue-time-avg, since records wait longer in the accumulator before being sent.
A lower linger.ms value — 0 to 5ms — minimises record-queue-time-avg, which is appropriate for latency-sensitive workloads, but typically produces smaller batches and lower throughput per network request.
If batch-size-avg is consistently close to your configured batch.size, the producer is filling batches before the linger timer fires. This is a sign of a high-throughput workload where linger.ms has less influence on batch efficiency.
max.in.flight.requests.per.connection
This setting controls how many unacknowledged produce requests the Sender thread can pipeline on a single broker connection before blocking.
Higher values increase throughput by allowing more requests in flight simultaneously, but they introduce ordering risks. If a broker returns an error for request N and the producer retries, request N+1 may arrive before the retry completes, resulting in out-of-order delivery on that partition.
For idempotent producers, the broker's sequence number tracking handles this risk — but only up to 5 concurrent in-flight requests per connection. Beyond that, idempotence guarantees do not hold.
Monitor requests-in-flight alongside record-retry-rate. If retries are frequent and requests-in-flight is consistently at the configured maximum, reducing the in-flight limit may improve delivery stability at some cost to throughput.
compression.type
Compression is applied per batch before the batch is dispatched by the Sender thread. The available options are none, gzip, snappy, lz4, and zstd.
compression-rate-avg reflects the effectiveness of the chosen algorithm: a value of 0.4 means the compressed batch is 40% of its original size. Lower values indicate better compression, which reduces network bandwidth and broker disk usage.
The CPU cost varies significantly by algorithm. lz4 and snappy offer fast compression with moderate ratios, which suits latency-sensitive workloads. zstd and gzip achieve better compression ratios but are more CPU-intensive. If you observe higher request-latency-avg after enabling compression, check producer host CPU utilisation — the Sender thread may be CPU-bound.
Compression shifts CPU work from the broker to the producer at write time, and to the consumer at read time. Weigh the network and storage savings against the CPU cost on both sides.
buffer.memory and max.block.ms
buffer.memory sets the total size of the RecordAccumulator pool in bytes (default: 32 MB). max.block.ms sets how long an application thread will block in send() waiting for buffer space before throwing a TimeoutException.
When broker throughput drops — due to disk saturation, replication lag, or quota throttling — the Sender thread drains batches more slowly. Batches accumulate in the accumulator, buffer-available-bytes drops toward zero, and application threads begin to block. bufferpool-wait-time rises above zero. If a thread remains blocked for longer than max.block.ms, the call fails with a TimeoutException.
waiting-threads rising above zero is always a signal that the producer is experiencing back-pressure. Increasing buffer.memory gives more headroom before application threads block, but does not address the underlying cause of the back-pressure.
Collecting producer metrics
JMX collection from the producer process
The Kafka Java producer client registers JMX MBeans with the platform MBean server of its JVM automatically. No additional configuration is required for local JMX access.
To enable remote JMX access for external tooling, pass the following JVM flags when starting your producer application:
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
For production environments, replace the authenticate=false and ssl=false flags with appropriate authentication and TLS configuration.
Once the JMX port is open, you can connect using tools like jconsole, jmxterm, or any JMX-compatible client to inspect MBeans under kafka.producer:type=producer-metrics,client-id={client-id}.
Prometheus JMX exporter
The most common production approach is to run the Prometheus JMX Exporter as a Java agent inside the producer JVM. It polls the local MBean server and exposes the metrics as a Prometheus-compatible HTTP endpoint, without requiring remote JMX access.
Add the agent to your JVM startup arguments:
export JAVA_OPTS="-javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent-0.20.0.jar=9404:/etc/prometheus/kafka-producer-jmx-rules.yml"
A rules configuration file that captures both producer-level and topic-level metrics:
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# Producer-level metrics (throughput, latency, buffer, errors)
- pattern: 'kafka.producer<type=producer-metrics, client-id=(.+)><>([^:]+):'
name: kafka_producer_$2
type: GAUGE
labels:
client_id: "$1"
# Topic-level metrics (per-topic record-send-rate, byte-rate)
- pattern: 'kafka.producer<type=producer-topic-metrics, client-id=(.+), topic=(.+)><>([^:]+):'
name: kafka_producer_topic_$3
type: GAUGE
labels:
client_id: "$1"
topic: "$2"
This configuration exposes metrics like kafka_producer_record_error_rate{client_id="orders-service"} and kafka_producer_topic_record_send_rate{client_id="orders-service",topic="orders"} at http://localhost:9404/metrics.
In Strimzi-managed Kubernetes deployments, you can keep the JMX port (typically 9999) open for tooling while dedicating port 9404 to Prometheus scraping.
Multi-producer aggregation
When running multiple producer instances — across replicated services or multiple application pods — you need a strategy for identifying individual producers in your metrics pipeline.
Each producer instance is identified by its client.id configuration. Set this to something that includes both the service name and the instance identifier, for example orders-service-pod-0. This value becomes the client_id label in Prometheus, allowing you to filter and aggregate metrics per producer.
If client.id is left at its default value of producer-1, all instances in a fleet will appear identical in your metrics store. When a single instance begins experiencing elevated record-error-rate or bufferpool-wait-time, you will have no way to isolate it.
Tag producer metrics with at least:
client_id: the producer instance identifiertopic: the target topic (available on topic-scoped metrics)- Any additional labels your platform uses for environment, region, or service ownership
Micrometer and OpenTelemetry integration
Micrometer
If you are using Spring Kafka or another framework that integrates with Micrometer, the KafkaClientMetrics binder forwards JMX metrics to any Micrometer-supported registry. Metric names follow the same naming convention as the JMX attributes but are prefixed with kafka.producer. and use dots rather than underscores — for example, kafka.producer.record-error-rate instead of kafka_producer_record_error_rate.
OpenTelemetry
The OpenTelemetry JMX Metric Gatherer supports Kafka producers natively through its kafka-producer target system. The gatherer runs as a sidecar or separate process, queries the producer JVM over JMX/RMI, and pushes metrics to an OpenTelemetry Collector:
receivers:
jmx/producer:
jar_path: /opt/opentelemetry-jmx-metrics.jar
endpoint: localhost:9999
target_system: kafka-producer
collection_interval: 10s
exporters:
otlp:
endpoint: otel-collector.monitoring.svc.cluster.local:4317
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [jmx/producer]
exporters: [otlp]
The OTel target system exposes metric names prefixed with kafka.producer. — for example, kafka.producer.record-retry-rate and kafka.producer.request-latency-avg.
Producer health check script
The script below provides a point-in-time diagnostic for a running Kafka producer. It queries a Prometheus JMX Exporter endpoint and evaluates the key indicators of producer health against fixed thresholds.
Companion scripts covering broker-side and consumer-side checks are described in the Kafka broker monitoring article and the Kafka consumer monitoring article. The three scripts share the same pattern: query the metrics endpoint, evaluate against thresholds, and exit with a status code compatible with standard monitoring integrations.
Prerequisites:
- The Prometheus JMX Exporter is running on the producer JVM (default port: 9404)
curlandawkare available in the shell environment
Thresholds:
#!/usr/bin/env bash
# kafka-producer-health-check.sh
#
# Checks the health of a running Kafka producer by querying its
# Prometheus JMX Exporter endpoint.
#
# Usage:
# ./kafka-producer-health-check.sh [client_id] [metrics_port]
#
# Arguments:
# client_id (optional) Filter output to a specific producer client.id
# metrics_port (optional) JMX Exporter HTTP port. Default: 9404
#
# Exit codes:
# 0 = OK
# 1 = WARNING
# 2 = CRITICAL
# 3 = UNKNOWN
CLIENT_ID="${1:-}"
METRICS_PORT="${2:-9404}"
METRICS_URL="http://localhost:${METRICS_PORT}/metrics"
# Thresholds
CRIT_RECORD_ERROR_RATE="0.0"
CRIT_BUFFERPOOL_WAIT="0.10"
CRIT_WAITING_THREADS="0"
WARN_QUEUE_TIME_MS="500"
WARN_REQUEST_LATENCY_MS="200"
WARN_RETRY_RATE="10.0"
EXIT_CODE=0
fetch_metric() {
local name="$1"
local label_filter="$2"
if [ -n "$label_filter" ]; then
curl -sf "$METRICS_URL" | grep "^${name}{" | grep "${label_filter}" | awk '{print $2}' | head -1
else
curl -sf "$METRICS_URL" | grep "^${name}" | awk '{print $2}' | head -1
fi
}
check() {
local label="$1"
local value="$2"
local threshold="$3"
local op="$4"
local severity="$5"
if [ -z "$value" ]; then
printf "%-10s %s (metric not found)\n" "UNKNOWN" "$label"
[ "$EXIT_CODE" -lt 3 ] && EXIT_CODE=3
return
fi
local triggered
triggered=$(awk "BEGIN { print (${value} ${op} ${threshold}) ? 1 : 0 }")
if [ "$triggered" -eq 1 ]; then
printf "%-10s %s = %s [threshold %s %s]\n" "$severity" "$label" "$value" "$op" "$threshold"
[ "$severity" = "CRITICAL" ] && EXIT_CODE=2
[ "$severity" = "WARNING" ] && [ "$EXIT_CODE" -lt 2 ] && EXIT_CODE=1
else
printf "%-10s %s = %s\n" "OK" "$label" "$value"
fi
}
LABEL_FILTER=""
[ -n "$CLIENT_ID" ] && LABEL_FILTER="client_id=\"${CLIENT_ID}\""
echo "Kafka producer health check"
echo "Endpoint: ${METRICS_URL}"
echo "Client ID: ${CLIENT_ID:-all}"
echo "---"
check "record-error-rate" \
"$(fetch_metric kafka_producer_record_error_rate "$LABEL_FILTER")" \
"$CRIT_RECORD_ERROR_RATE" ">" "CRITICAL"
check "bufferpool-wait-time" \
"$(fetch_metric kafka_producer_bufferpool_wait_time "$LABEL_FILTER")" \
"$CRIT_BUFFERPOOL_WAIT" ">" "CRITICAL"
check "waiting-threads" \
"$(fetch_metric kafka_producer_waiting_threads "$LABEL_FILTER")" \
"$CRIT_WAITING_THREADS" ">" "CRITICAL"
check "record-queue-time-avg (ms)" \
"$(fetch_metric kafka_producer_record_queue_time_avg "$LABEL_FILTER")" \
"$WARN_QUEUE_TIME_MS" ">" "WARNING"
check "request-latency-avg (ms)" \
"$(fetch_metric kafka_producer_request_latency_avg "$LABEL_FILTER")" \
"$WARN_REQUEST_LATENCY_MS" ">" "WARNING"
check "record-retry-rate" \
"$(fetch_metric kafka_producer_record_retry_rate "$LABEL_FILTER")" \
"$WARN_RETRY_RATE" ">" "WARNING"
AVAIL=$(fetch_metric kafka_producer_buffer_available_bytes "$LABEL_FILTER")
TOTAL=$(fetch_metric kafka_producer_buffer_total_bytes "$LABEL_FILTER")
if [ -n "$AVAIL" ] && [ -n "$TOTAL" ] && [ "$TOTAL" != "0" ]; then
USED_PCT=$(awk "BEGIN { printf \"%.1f\", (1 - ${AVAIL} / ${TOTAL}) * 100 }")
printf "%-10s %s = %s%% (%s of %s bytes available)\n" \
"INFO" "buffer-utilisation" "$USED_PCT" "$AVAIL" "$TOTAL"
fi
echo "---"
case "$EXIT_CODE" in
0) echo "Result: OK" ;;
1) echo "Result: WARNING" ;;
2) echo "Result: CRITICAL" ;;
3) echo "Result: UNKNOWN" ;;
esac
exit "$EXIT_CODE"Limitations: This script evaluates one point in time. It does not detect trends, multi-instance comparisons, or intermittent spikes. It also does not query broker-side metrics — a clean result here does not rule out a broker problem contributing to producer behaviour. Use it for incident investigation or pre-deployment checks, not as a substitute for continuous alerting.
Alerting strategy for Kafka producer monitoring
The table below describes a baseline alert set for Kafka producer monitoring. For each alert, the condition reflects sustained behaviour rather than instantaneous spikes, which reduces false positives during normal transient events like leader elections.
For use with Prometheus Alertmanager, the following rule definitions cover the most critical conditions:
groups:
- name: kafka-producer-alerts
rules:
- alert: ProducerDeliveryFailures
expr: rate(kafka_producer_record_error_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Producer {{ $labels.client_id }} is dropping records"
description: >
The record error rate on client {{ $labels.client_id }} has exceeded 0.1 errors/second
for 2 minutes. Records are failing to deliver and may be lost.
- alert: ProducerBackPressure
expr: rate(kafka_producer_bufferpool_wait_time_total[2m]) > 0.15
for: 1m
labels:
severity: critical
annotations:
summary: "Producer {{ $labels.client_id }} buffer exhaustion"
description: >
Memory allocation wait rate for client {{ $labels.client_id }} has exceeded 15% over
2 minutes. Application threads are blocking on .send() calls.
- alert: ProducerQueueDelay
expr: kafka_producer_record_queue_time_avg > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Producer {{ $labels.client_id }} accumulator queue delay"
description: >
Average record queue wait time has exceeded 500ms for 5 minutes. This may indicate
broker degradation, network bottlenecks, or quota throttling.
- alert: ProducerHighRetryRate
expr: rate(kafka_producer_record_retry_total[5m]) > 10.0
for: 3m
labels:
severity: warning
annotations:
summary: "High retry rate on producer {{ $labels.client_id }}"
description: >
The record retry rate is exceeding 10 retries/second for 3 minutes. This suggests
transient network instability or a cluster rebalancing event.
The thresholds above are starting points rather than universal standards. A request-latency-avg of 200ms may be acceptable for a batch analytics pipeline but not for a payment processing service. Adjust thresholds to match your SLOs and your cluster's normal operating range.
Common Kafka producer issues and how to diagnose
Producer stalling due to full send buffer
Symptom: Application threads block on send(), eventually throwing TimeoutException after max.block.ms elapses.
Metric signals: bufferpool-wait-time approaching or above 0.10; buffer-available-bytes near zero; waiting-threads above 0; record-queue-time-avg elevated.
Diagnostic steps:
- Confirm that
buffer-available-bytesis depleted andbufferpool-wait-timeis non-zero. This pattern confirms back-pressure rather than a serialization issue — serialization failures drop records before they reach the buffer, leavingbuffer-available-bytesnear its maximum. - Check broker
TotalTimeMsand its component latencies. If the local log append component (t_local) is elevated, the broker disk may be saturated. If the replication wait component (t_remote) is elevated, there may be ISR lag. - Check broker
RequestQueueSize. A growing queue indicates that broker network threads are saturated and cannot process incoming requests fast enough. - If the broker is healthy, review
batch-size-avgrelative tobatch.size. An oversizedbatch.sizecan cause the buffer to fill faster than the Sender thread can drain it, particularly at low throughput where batches take longer to fill.
Retry storm from leader election
Symptom: A spike in record-retry-rate and request-latency-avg, typically coinciding with a broker restart or failure event.
Metric signals: record-retry-rate spikes; request-latency-avg increases; record-error-rate may briefly rise before returning to zero.
Cause: During a partition leader election, brokers return NotLeaderForPartitionException for produce requests targeting the old leader. The Kafka client retries these requests automatically once metadata is refreshed and a new leader is elected.
Brief retry spikes during failover are expected and do not indicate data loss, as long as record-error-rate returns to zero after the election completes and record-retry-rate subsides. If retries persist, check that retries and retry.backoff.ms are set appropriately for your cluster's typical election duration, and verify that the affected topic's ISR is healthy.
Idempotent producer failure under high in-flight requests
Symptom: The broker returns OutOfOrderSequenceException; the producer surfaces this as a delivery failure.
Metric signals: record-error-rate spikes; record-retry-rate may be elevated.
Cause: When enable.idempotence=true is set, the broker tracks sequence numbers per partition to detect and discard duplicates. If max.in.flight.requests.per.connection exceeds 5, the broker cannot reliably order sequence numbers across concurrent in-flight batches, which produces this exception.
To resolve this, set max.in.flight.requests.per.connection to 5 or lower when enable.idempotence=true is configured. From Kafka 3.0 onward, the client enforces this constraint automatically when idempotence is explicitly enabled; earlier versions do not.
Serialization errors surfacing as send failures
Symptom: record-error-rate spikes, but broker metrics and network metrics are unchanged.
Metric signals: record-error-rate elevated; request-latency-avg, record-queue-time-avg, and broker-side metrics all remain at normal levels; buffer-available-bytes stays near its maximum.
Cause: Serialization failures occur before a record is written to the RecordAccumulator. If the serializer throws an exception — for example, when a value does not conform to an Avro schema — the record is never queued and no network activity occurs. This is the key distinction from network or broker failures, where buffer utilisation also rises.
Check producer logs for SerializationException or schema registry errors. If you are using Confluent Schema Registry, a schema compatibility violation will produce this pattern: record-error-rate rises while all broker and buffer metrics remain stable.
Compression CPU overhead causing latency regression
Symptom: request-latency-avg increases after enabling or changing compression, even though broker health is unchanged.
Metric signals: request-latency-avg elevated; compression-rate-avg low (indicating good compression, but at a CPU cost); producer host CPU utilisation elevated.
Cause: Compression is performed by the Sender thread before each batch is dispatched. On a CPU-constrained host, heavyweight algorithms such as gzip or zstd can slow the Sender thread enough to increase observed request latency, even though the broker is responding quickly.
To diagnose, compare request-latency-avg before and after enabling compression, and monitor producer host CPU. If compression is causing a latency regression, consider switching to lz4, which typically offers a better throughput-to-CPU ratio than gzip or zstd for latency-sensitive workloads. Weigh the network bandwidth and broker disk savings against the producer CPU cost before deciding.
Best practices for Kafka producer monitoring
- Monitor
record-error-rateandbufferpool-wait-timeas the minimum baseline for every producer. These two metrics surface the most operationally significant failure conditions — delivery failures and upstream stalling — and should be alerting targets in any environment. - Set
acks=allandenable.idempotence=truefor any workload where data loss is unacceptable, and account for the resulting increase inrequest-latency-avgwhen setting alert thresholds. A latency threshold calibrated foracks=1will produce false positives underacks=all. - Use
linger.msgreater than 0 in throughput-optimised workloads, and verify the effect by monitoringbatch-size-avgandrecords-per-request-avg. If batch sizes do not increase after raisinglinger.ms, the producer may already be limited bybatch.sizeor by low topic throughput. - Set
client.idexplicitly on every producer instance. In environments running multiple producer instances, undifferentiated client IDs make it impossible to isolate a misbehaving instance whenrecord-error-rateorbufferpool-wait-timerises in your aggregated metrics. - Align alert thresholds with your service-level objectives rather than generic defaults. The threshold values in this article are reasonable starting points, but what constitutes a normal or degraded baseline varies by workload type and cluster configuration.
- Correlate producer metrics with broker and consumer metrics before concluding that a producer is the source of a problem. Elevated
request-latency-avgis frequently a symptom of broker disk saturation, replication lag, or quota throttling. For a complete picture, refer to the Kafka broker monitoring article and the Kafka consumer monitoring article. - For transactional producers, extend your monitoring to include
flush-time-ns-totalandtxn-send-offsets-time-ns-total. Spikes in these values indicate bottlenecks in the transactional commit path that would not appear in the standard metric set but can cause application threads to stall.
Monitor Kafka producers with Factor House
Kpow provides real-time visibility into your Kafka producers without requiring a separate metrics pipeline. You can inspect producer metrics, configuration, and per-topic throughput directly from the UI, alongside broker and consumer monitoring in the same tool.
Give Kpow a try for yourself with a free 30-day trial. You can connect it to any Kafka cluster in minutes and deploy it via Docker, Helm, or JAR.