
Best practices for Kafka data observability
Table of contents
Key takeaways
- Kafka data observability is distinct from broker monitoring. Broker metrics tell you the cluster is healthy; data observability tells you whether the data flowing through it is correct, fresh, complete, and structurally sound.
- Consumer lag is your north-star metric, but raw offset thresholds are an unreliable way to measure it. Trend-based evaluation per consumer group, per partition, is more accurate.
- Schema Registry with enforced compatibility in CI is non-negotiable above a handful of services. Schema drift causes silent consumer failures that broker dashboards will never surface.
- End-to-end auditing, counting messages at every tier from producer to consumer, catches data loss that infrastructure monitoring alone cannot detect.
- Tools like Kpow consolidate broker health, consumer lag, schema management, and Kafka Streams visibility into a single interface, reducing the operational overhead of maintaining separate observability tooling.
Introduction
Kafka's operational surface area has two distinct layers. The first is the cluster itself: broker health, partition replication, JVM pressure, disk utilisation. The second is the data: whether messages are arriving on time, whether the schema is what consumers expect, whether any messages were lost in transit.
Most teams instrument the first layer thoroughly and underinvest in the second. The consequence is a pattern that practitioners have converged on calling "green dashboards, broken data." The cluster looks healthy. CPU is normal. ISR is stable. And yet, somewhere downstream, a consumer is silently deserialising nulls because a field was renamed three hours ago.
PagerDuty's August 2025 outage is the most thoroughly documented recent example of this failure mode. Brokers looked fine throughout the incident, but a producer-instance leak in a pekko-connectors-kafka integration caused approximately 4.2 million new producers to register per hour, roughly 84 times the normal rate. Brokers exhausted JVM heap tracking producer metadata, the cluster cascaded, and approximately 95% of incoming events were rejected over 38 minutes. PagerDuty's own post-mortem explicitly names two causes: "previously minimal alerting on Kafka" and an "observability gap in Kafka producer and consumer telemetry including anomaly detection for unexpected workloads."
Infrastructure monitoring alone could not have caught it. This guide covers the practices that close that gap.
What Kafka data observability actually means
Data observability, as applied to Kafka, asks five questions about data in motion:
The distinction from cluster monitoring is important. Monitoring tells you consumer lag spiked at 3:15 PM. Observability tells you consumer lag spiked at 3:15 PM, correlating with schema version 47 deployed at 3:12 PM, which introduced a backward-incompatible field removal causing consumer deserialization failures. The shift is from cluster-centric JMX metrics to data-centric signals.
12 best practices for Kafka data observability
1. Use trend-based consumer lag monitoring, not threshold alerts
Raw offset thresholds are a losing proposition for consumer lag alerting. A traffic spike will trip a fixed threshold even when the consumer is keeping up. Aggregating lag across partitions hides per-partition stalls, meaning one stuck partition out of ten will disappear into the average.
LinkedIn's SRE team built Burrow specifically to address this. Rather than alerting on absolute lag values, Burrow evaluates each consumer group over a sliding window of committed offsets and classifies it as OK, WARN, or ERROR based on lag trend. Uber extended the same idea with uGroup, which decodes the __consumer_offsets topic directly because consumer-side metrics cannot fully account for all group activity, particularly when consumers are experiencing problems.
The practical implementation is to scrape Burrow's REST API into Prometheus and alert on ERROR state persisting for five or more minutes. Conduktor reported going from 47 lag alerts per month (2 real) to 3 alerts per month (all real) after switching from threshold-based to rate-of-change alerting.
For more on this topic, see our consumer lag monitoring guide.
2. Track consumer lag as both a count and a time
Offset lag and time lag measure different things. "A consumer is 50,000 messages behind" depends entirely on the traffic shape of that topic. "A fraud-detection consumer is 30 seconds behind" is immediately actionable.
Most major platforms expose both. Datadog Data Streams Monitoring, the Confluent Metrics API, and AWS MSK CloudWatch all provide both OffsetLag and EstimatedTimeLag. Time-based lag is the SLO metric that matters to the business. Define your SLOs in time units, alert in time units.
A useful PromQL pattern for lag alerting that accounts for both lag and lack of progress:
max by(group, topic)(kafka_consumer_group_partition_lag) > 10000
and
rate(kafka_consumer_fetch_manager_records_consumed_total[5m]) == 0
This fires on high lag combined with zero consumption progress, which filters out deployment blips and traffic spikes where the consumer is genuinely keeping up.
3. Enforce schema compatibility in CI before any producer deploys
Above roughly five services producing to Kafka, Schema Registry with enforced compatibility transitions from a nice-to-have to a requirement. The failure mode without it is well-documented: a producer deploys a new schema version that removes a field, consumers expect that field and crash when it is missing, messages accumulate unprocessed, lag grows into the millions, and alerts fire hours after the schema change that caused it.
The defaults that most practitioners converge on: BACKWARD for Avro and JSON Schema, BACKWARD_TRANSITIVE for Protobuf. The distinction matters for Protobuf because adding new message types is not forward-compatible, so transitive checking across all previous versions is required.
The enforcement point is at least as important as the compatibility mode. Community Schema Registry enforces only in the producer SDK, which any client speaking the Kafka wire protocol directly can bypass. Gate schema changes in CI with mvn schema-registry:test-compatibility or the Gradle equivalent, and fail the build on incompatibility. Disable auto.register.schemas in production so rogue producers cannot silently introduce new schemas.
One important operational note: Confluent Schema Registry defaults to BACKWARD, but AWS Glue Schema Registry defaults to DISABLED and Apicurio defaults to NONE. Always verify and set compatibility modes explicitly rather than relying on defaults.
4. Go beyond schema validation with data contracts
Schema compatibility validates structure. Data contracts extend that to semantics and SLAs.
The Open Data Contract Standard (ODCS), originally open-sourced by PayPal and now maintained under the Bitol project, extends schemas with data-quality rules, SLAs, security classifications, and ownership metadata. Confluent's Data Contracts feature implements field-level validation using Common Expression Language (CEL) rules at serialization time, including validators like isEmail and isUuid, field transformations, and DLQ routing on rule failure.
The boundary on enforcement is worth understanding clearly. Community Schema Registry validates client-side only. Anything that speaks the Kafka wire protocol directly sidesteps the entire enforcement chain. Broker-side validation is available as a paid Confluent feature; Redpanda's Wasm Data Transforms provide broker-side validation in open source. If you cannot use broker-side enforcement, supplement contract validation with audit-tier message counting (see practice 7).
5. Implement dead letter queues with rich metadata and active monitoring
A dead letter queue with no observability is a silent failure sink. Messages arrive, nobody sees them, and what should have been a detectable error becomes an invisible data hole.
The pattern that most teams have converged on, originally published by Uber Insurance Engineering: a tiered retry structure with a main topic, a series of count-based retry topics with exponential backoff (topic.retry.1 at one minute, topic.retry.2 at five minutes), and a final DLQ for manual review. Each DLQ message should carry headers recording the original topic, partition, offset, exception class, attempt count, and stack trace.
Kafka Connect has native DLQ support since version 2.0. Kafka Streams requires custom DeserializationExceptionHandler and ProductionExceptionHandler implementations.
The monitoring rules: alert on DLQ message rate above 10 messages per second for five minutes as a warning, and DLQ backlog above 1,000 messages for 15 minutes as critical. Feed DLQ message rates into ksqlDB to drive per-sink breach alerts.
6. Track end-to-end latency with OpenTelemetry trace context in message headers
Broker and consumer metrics cannot tell you how long a specific message spent in a topic waiting to be consumed. OpenTelemetry's messaging semantic conventions solve this at the per-message level.
Producers inject a W3C traceparent header into each Kafka message. Consumers extract it and create a child span linked to the producer span. The gap between the producer span ending and the consumer span starting is the topic dwell time for that message. Relevant span attributes per the OpenTelemetry messaging semantic conventions include messaging.system=kafka, messaging.destination.name, messaging.kafka.partition, messaging.kafka.message.offset, and messaging.message.id.
For high-throughput topics, use tail-based sampling. Keep all error spans and all spans exceeding p99 latency; sample the rest. Strimzi shipped OpenTelemetry support for Kafka Connect, MirrorMaker 2, and the Kafka Bridge in 2023.
A practical note: OpenTelemetry messaging semantic conventions are still evolving as of mid-2026. Some span attributes including messaging.kafka.consumer.group and messaging.message.body.size are stable; others remain experimental. Pin your instrumentation library versions in production.
7. Build end-to-end auditing for true data-loss detection
This is the practice that separates mature streaming organisations from those that discover data loss during incident response. Broker metrics, consumer lag, and distributed traces do not tell you whether a message was dropped in transit between tiers. End-to-end auditing does.
The pattern, implemented by Uber as Chaperone and Netflix as Inca: every tier (proxy, broker, mirror, aggregate, consumer) emits an audit message to a dedicated audit topic with a tuple of message ID, tier name, count, and time window. A stream-processing job (Flink at Netflix, custom at Uber) keys by message ID, fans across tiers, and emits a missing-trace signal when an expected tier did not report within the time window.
Netflix's Inca uses a Flink GlobalWindow with a custom trigger and reduces state via consumer offsets. It produces loss-rate, duplicate-rate, and end-to-end latency as SLO metrics, plus the IDs of lost messages to a Kafka topic for automated re-fetch.
Uber's published example of why this matters: a dead-loop bug in uReplicator caused silent data loss. Neither uReplicator nor the Kafka brokers triggered any alerts. Only the end-to-end message counting job detected the loss. A pure broker-and-replicator-metric observability stack would never have surfaced it.
This level of investment is appropriate once you have more than 100 topics or more than 10 teams producing to Kafka. For earlier-stage deployments, the priority is practices 1 through 4.
8. Enforce topic naming conventions through CI
Topic naming is cheap to get right and expensive to get wrong. Kafka topics cannot be renamed after creation, so naming decisions made during initial development persist indefinitely.
The community-converged convention is hierarchical:
<env>.<visibility>.<type>.<domain>.<entity>-by-<key>-v<n>
# Example
prod.public.fct.payments.payment_completed-by-order_id-v2
Hard rules: lowercase only, choose either . or _ as the separator and use it consistently (mixing both creates collisions on JMX metric names), exclude any fields that might change over time such as team name, service name, or owner, and reserve the version suffix for backward-incompatible breaks only.
Disable auto.create.topics.enable on all production clusters. Provision topics through GitOps using Strimzi KafkaTopic custom resources, Confluent for Kubernetes, or an in-house provisioning API. Validate names against a regex in a pre-commit hook. DoorDash replaced its Terraform-based topic-creation flow with an in-house API and reduced real-time pipeline onboarding time by 95%.
9. Set retention and compaction policies with observability in mind
Compacted topics (cleanup.policy=compact) only retain the most recent value per key. Intermediate values may be deleted by compaction before a consumer reads them. This breaks audit-tier counting and any observability approach that relies on replaying message history.
Reserve compaction for state changelogs, which is what Kafka Streams uses internally. Use cleanup.policy=delete for event topics, with retention.ms set to your maximum replay window. If you need long retention without the cost of persistent SSD storage, tiered storage (available in Confluent Cloud and as an open-source implementation from Pinterest via KIP-405) lets you extend retention to object storage.
Track LogSegmentBytes per topic and alert on disk-pressure-driven retention truncation, which silently shortens your replay window without any explicit configuration change.
10. Add heartbeats and offset-translation verification for multi-cluster setups
In multi-cluster deployments, MirrorMaker 2's replication of data is generally reliable. The harder problems are detecting replication lag, verifying offset translation, and coordinating client failover.
MirrorMaker 2's MirrorHeartbeatConnector produces a heartbeat on every source cluster every five seconds. Alert if the heartbeat does not appear in the destination within replication_factor × interval. MirrorCheckpointConnector translates consumer offsets across clusters; verify by spot-checking that sync.group.offsets.enabled=true and that a test consumer can fail over cleanly.
Every published Kafka DR retrospective converges on the same lesson: replication is the solved part. Coordinating client failover during an actual incident, while also managing schema-registry consistency and consumer-group state across clusters, is where DR plans actually fail. Test your DR plan with current tooling and find the coordination gaps before investing in additional cross-cluster observability tooling.
11. Alert on SLOs and symptoms, not on infrastructure causes
URP (Under-Replicated Partitions) is the most commonly over-alerted Kafka metric. Todd Palino, who led SRE for Kafka at LinkedIn, has explicitly stated that URP does not map to an SLO and is often not actionable; it should be collected for forensics but should not page anyone.
The metrics that should page are those that map directly to user-visible impact: under-min-ISR partitions (genuine data-loss risk), consumer-group ERROR state from trend-based lag monitoring, DLQ rate breach, schema compatibility failure rate, and audit-tier loss-rate breach.
The practical approach:
- Define SLOs first: "p99 end-to-end latency under 5 seconds for the
paymentstopic" or "fraud-detection consumer lag under 30 seconds." - Alert on SLO burn rate using multi-window burn-rate alerting.
- Demote URP, ISR shrink, broker CPU, and request-handler idle ratio to dashboards available for forensics, not to PagerDuty.
Aggregate consumer group lag with max() per partition, not avg(). One stalled partition out of ten will be invisible in the average.
12. Treat data lineage as a runtime artifact
Static documentation of data lineage becomes inaccurate as soon as a pipeline changes. Runtime lineage, emitted as events as jobs execute, stays current.
OpenLineage's specification uses Job, Run, and Dataset entities with Facets for schema, quality, and other metadata. It emits events natively from Airflow, dbt, Spark, and Flink to Marquez (the reference implementation) or DataHub (originally LinkedIn, now maintained by Acryl Data). DataHub treats Kafka topics as first-class datasets, parses Kafka Connect SMT configurations including RegexRouter and EventRouter to resolve the actual destination topic, and supports column-level lineage where schema-registry data is available.
Emit OpenLineage events from every job that touches Kafka. The producer-to-topic-to-consumer graph becomes queryable, auditable, and accurate in near-real time rather than a diagram that someone last updated six months ago.
Tooling stack
No single tool covers the full observability surface, but a comprehensive Kafka management platform like Kpow by Factor House gets close, combining broker monitoring, consumer lag, schema registry, DLQ management, and Kafka UI and management in one place. For teams that need additional capabilities, the ecosystem includes purpose-built tools across the following categories:
A reasonable starting point for a team with no existing investment: Prometheus + JMX Exporter + Kafka Exporter for cluster metrics, Burrow for consumer lag, Apicurio or Confluent Schema Registry community edition for schemas, OpenTelemetry Collector with Jaeger or Tempo for tracing, and OpenLineage with Marquez or DataHub for lineage. Add a Kafka UI for ad-hoc inspection and correlating data-level signals with cluster state. This broadly matches Cloudflare's observability stack and the direction Shopify moved toward when consolidating off third-party tooling.
For teams evaluating their options more broadly, see our guide to the best Kafka monitoring tools.
How Kpow helps with Kafka data observability
Kpow by Factor House covers several of the observability concerns described in this article within a single deployable tool.

Consumer lag visibility. Kpow surfaces consumer group lag at the group, broker, topic, and partition level. It handles both active groups and EMPTY consumer groups, calculating lag for empty groups directly from start and end offsets via the AdminClient rather than relying on a cached snapshot. This matters in scenarios where a poison message has caused all instances of a consumer group to go offline; Kpow can still read the offsets and allow you to reset them without requiring the group to be running. Kpow also identifies simple consumers (those using manual partition assignment without group coordination), which appear in their own tab and are common in some Flink and Spark deployments.
Schema Registry management. Kpow integrates with Schema Registry, allowing you to inspect schema versions, view compatibility settings, and manage schemas directly from the UI. This is useful when investigating deserialization errors or verifying that compatibility modes are set correctly across environments.
Prometheus egress for alerting integration. Kpow exposes Prometheus endpoints following the OpenMetrics standard, making it straightforward to pipe Kafka metrics into your existing Grafana dashboards or AlertManager setup. Available endpoints include /metrics/v1 for all cluster metrics, /group-offsets/v1 for per-assignment group offset data, /offsets/v1 for topic partition offsets, and /streams/v1 for Kafka Streams metrics from connected agents. The group offset endpoint exposes group_assignment_delta, group_assignment_last_read, and group_assignment_offset at the partition assignment level, which gives you the granularity needed for accurate lag alerting.
Kafka Streams observability. Through the kpow-streams-agent, Kpow collects Kafka Streams application metrics and exposes them via the /streams/v1 and /streams/v1/state Prometheus endpoints. This provides visibility into Kafka Streams topology state alongside broker and consumer metrics.
Kafka Connect management. Kpow allows you to manage Kafka Connect connectors from the same interface, which reduces the context-switching involved when investigating pipeline issues that span brokers, consumers, and connectors.
Kpow connects to any Kafka cluster and deploys via Docker, Helm, or JAR. If you want to evaluate it against your current observability setup, you can try Kpow free for 30 days.
Implementation roadmap
The practices in this guide are not all equal in complexity or return on investment. A phased approach:
Within 30 days (foundation):
- Deploy trend-based consumer lag monitoring (Burrow or equivalent) and replace any threshold-based lag alerts with
ERROR/WARNclassification per consumer group. - Stand up Schema Registry with compatibility set explicitly:
BACKWARDfor Avro/JSON Schema,BACKWARD_TRANSITIVEfor Protobuf. Disableauto.register.schemasin production. Add compatibility checking to CI as a hard gate. - Disable
auto.create.topics.enableon all production clusters. Move topic provisioning into GitOps. - Audit your current alerts. Demote URP, ISR shrink, and broker CPU to dashboards. Keep paging for under-min-ISR partitions, consumer-group
ERROR, DLQ rate breach, and schema compatibility failure rate.
Within 90 days (data-level observability):
- Instrument all producers and consumers with OpenTelemetry. Ensure trace context propagates in message headers. Export to Jaeger or Tempo with tail-based sampling.
- Deploy tiered DLQ patterns for any consumer where message loss is unacceptable. Instrument DLQs with rich headers and alert on rate and backlog.
- Define an SLO per business-critical topic. Track burn rate and alert on multi-window burn.
- Emit OpenLineage events from any Kafka Connect, Flink, Spark, or Airflow job touching Kafka. Ingest into Marquez or DataHub.
Within 6 months (mature pattern, applicable above 100 topics or 10 producing teams):
- Build or adopt end-to-end auditing: count messages at each tier and use a Flink or ksqlDB job to detect loss and duplication per message ID.
- Adopt data contracts beyond schemas: field-level rules, SLAs, and ownership. Choose between Confluent Data Contracts (paid, broker-side enforcement), ODCS with Redpanda Data Transforms (open source, broker-side), or producer-side CI validation.
- Consolidate dashboards so broker health, consumer lag, schema events, DLQ rates, and end-to-end traces are visible in a single view.
If you are experiencing more than three production incidents per quarter caused by schema changes, accelerate the schema enforcement work. If your mean time to detection for streaming pipeline incidents exceeds five minutes, end-to-end tracing and SLO-based alerting are the highest-leverage next steps.