
Dead letter queues in Kafka: patterns and pitfalls
Table of contents
A single malformed record can block an entire Kafka partition. When your consumer encounters a message it cannot deserialize - a corrupted Avro payload, a JSON field with an unexpected type, a schema version that was deprecated months ago - it throws an exception, retries, and throws again. Consumer group lag climbs. Every healthy message behind that one record waits. This is a poison pill, and the standard solution is a dead letter queue.
Kafka has no native dead letter queue primitive outside of Kafka Connect. The broker has no concept of a failed message; failure is entirely a consumer-side concern. As a result, the DLQ is a pattern you implement in your consumer application, your Connect connector configuration, or your Kafka Streams topology. This article covers all three paths with working examples, explains the retry-topic pattern that should precede any DLQ, and walks through the production failure modes that are frequently omitted from shorter treatments of the topic.
By the end, you should be able to implement a DLQ in your own stack and know which failure modes to test for before shipping to production.
Key takeaways
- Kafka has no native dead letter queue outside of Kafka Connect. In all other cases, the DLQ pattern is implemented at the consumer application, Kafka Connect connector, or Kafka Streams topology layer.
- The production-grade shape is retry-topic plus DLQ: transient failures should cycle through a bounded retry pass before a message is parked, keeping the DLQ as a terminal state rather than another retry tier.
- Spring Kafka, Kafka Connect, and Kafka Streams each handle the DLQ differently. Connect is the only path that requires no custom code; Kafka Streams requires the most.
- The failure modes that most often break a DLQ in production are ordering loss on replay, schema mismatches in the DLQ topic itself, infinite retry loops between the source and retry topics, and side effects being re-applied during blind replay.
- Monitoring the age of the oldest unprocessed DLQ record is more actionable than monitoring topic depth alone, because it is independent of traffic volume.
- Kpow, the Kafka UI, includes bulk actions for managing dead letter queues, letting you clone and replay records from the DLQ to a retry or source topic with RBAC controls and a full audit trail.
What is a dead letter queue, and why doesn't Kafka have one?
In message brokers like RabbitMQ and AWS SQS, a dead letter queue is a broker-level feature. When a message exceeds its delivery attempts or cannot be routed, the broker moves it to a designated DLQ. Application code does not need to handle the routing.
Kafka's architecture works differently. The broker stores ordered byte sequences in partitioned logs and advances consumer offsets when the consumer commits them. Whether a message is valid or processable is a question the consumer answers, not the broker. If a consumer determines a message cannot be processed, it is the consumer's responsibility to publish that record elsewhere and commit the source offset so the partition can advance.
This means the DLQ pattern in Kafka is implemented at one of three layers:
- In the consumer application, using a framework like Spring Kafka with
DefaultErrorHandlerandDeadLetterPublishingRecoverer - In a Kafka Connect connector, using the native
errors.deadletterqueue.*configuration - In a Kafka Streams topology, using custom
DeserializationExceptionHandlerandProductionExceptionHandlerimplementations
Each path covers a different failure surface and has different operational trade-offs.
When you actually need a DLQ
Not every processing failure belongs in a DLQ. The pattern is appropriate in three situations.
Deserialization failures. A message arrives with an incompatible Avro schema, a field with the wrong type, or corrupted bytes. The consumer cannot read the record at all. No amount of retrying will fix a malformed message, so it needs to be parked for inspection.
Business-rule rejections. The message is well-formed and deserializes correctly, but the application logic cannot handle it: a referenced entity no longer exists, a required downstream system returned a permanent error, or a validation rule fails unconditionally. These are not transient failures. The application has determined the message cannot be processed as received.
Retry budget exhaustion. A message fails due to a transient cause - a downstream API timeout, a temporary database outage - but the consumer has retried it enough times that continuing to retry would be more disruptive than parking it. Once the retry budget is exhausted, the message should move to the DLQ rather than blocking the consumer.
Transient failures that are likely to resolve on their own should go through a retry topic first, not directly to the DLQ. The DLQ should be the terminal state: messages there require human or automated intervention to assess and replay.
The retry-topic plus DLQ pattern
The production-grade shape for Kafka consumer error handling combines a retry-topic tier with a terminal DLQ, rather than routing directly to the DLQ on first failure.
The flow works like this:
- A message fails processing on the main consumer.
- If the failure is transient, the consumer publishes the message to a retry topic (preserving the original key) and commits the source offset. The original consumer continues to the next record.
- A retry-topic consumer picks the message up after a delay. If it fails again, it either cascades to a second retry topic with a longer delay, or moves to the DLQ once the retry budget is exhausted.
- If the failure is immediately classified as non-transient (deserialization failure, schema mismatch), the consumer skips the retry tier entirely and publishes directly to the DLQ.
Uber's Insurance Engineering team documented a tiered approach of this kind in their engineering post on reliable reprocessing by Ning Xia and Phani Marupaka. Their implementation uses one Kafka topic per retry level (payments.retry-1, payments.retry-2, and so on), each with a consumer that applies a processing delay before re-attempting. Messages cascade through these levels before reaching the final DLQ topic. Their framing is worth adopting: consumer success means reaching a conclusive outcome for every message, either processed or parked, never silently dropped.
Two distinctions are worth being precise about.
A retry topic is for transient failures: the message may succeed on a second or third attempt once the transient condition resolves. A DLQ is the terminal park: the message has exhausted its remediation options and requires intervention before it can be processed. Treating the DLQ as another retry tier by routing messages back to the source automatically after parking them creates infinite loops, which is covered in the production failure modes section below.
Blocking versus non-blocking retry is a related choice. Blocking retry stalls the consumer on the failed message until it succeeds or a timeout elapses. This preserves strict per-partition ordering but prevents progress on the partition while the retry is running. Non-blocking retry publishes the message to a retry topic immediately and continues processing, trading per-key ordering for throughput. For most production workloads, non-blocking retry is the better default, with the exception of cases where strict per-key ordering is a hard requirement of the downstream system.
Implementing a DLQ with Spring Kafka
Spring Kafka's DefaultErrorHandler combined with DeadLetterPublishingRecoverer is the most common implementation path for JVM-based consumers. Here is a working configuration:
@Configuration
public class KafkaErrorHandlingConfig {
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<?, ?> kafkaTemplate) {
// Routes failed records to <source-topic>.DLT, preserving the source partition.
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(kafkaTemplate);
// Retry up to 3 times at 1-second intervals before invoking the recoverer.
FixedBackOff backOff = new FixedBackOff(1000L, 3L);
DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);
// These exception types bypass retries entirely and go straight to the DLT.
handler.addNotRetryableExceptions(
DeserializationException.class,
ValidationException.class
);
return handler;
}
}
By default, DeadLetterPublishingRecoverer routes to a topic named <source-topic>.DLT and preserves the source partition number. Your DLT topic must have at least as many partitions as the source topic, and the original message key is retained. Both conditions are necessary for replay ordering to work correctly.
One implementation detail worth understanding: the recoverer publishes the failed record to the DLT before the consumer commits the offset on the source topic. This is the at-least-once guarantee in practice. If the consumer crashes after publishing to the DLT but before committing, the source message will be reprocessed on restart, potentially resulting in a duplicate DLT entry. If you are writing to a DLT from a transactional consumer, configure enable.idempotence=true on the DLT producer to guard against this.
Spring Kafka adds a set of diagnostic headers to every DLT record:
HeaderContentskafka_dlt-original-topicSource topic namekafka_dlt-original-partitionPartition the record came fromkafka_dlt-original-offsetOffset within that partitionkafka_dlt-exception-messageException message as a stringkafka_dlt-exception-fqcnFully qualified exception class name
These headers are what allow you to locate the original record, understand what failed, and replay it correctly. Without them, a DLT is a topic full of byte arrays with no diagnostic context.
DLQ in Kafka Connect
Kafka Connect has first-class DLQ support through three connector properties:
errors.tolerance=all
errors.deadletterqueue.topic.name=my-connector.dlq
errors.deadletterqueue.context.headers.enable=true
errors.deadletterqueue.topic.replication.factor=3
Setting errors.tolerance=all without specifying a deadletterqueue.topic.name causes Connect to silently skip failed records. Always pair errors.tolerance=all with an explicit DLQ topic name to ensure failed records are routed rather than dropped.
With errors.deadletterqueue.context.headers.enable=true, Connect adds headers to each DLQ record describing the failure: the exception class, the stack trace, the source connector name, and the source topic, partition, and offset. Robin Moffatt's deep-dive on Kafka Connect error handling covers a useful recovery pattern: chain a second sink connector that reads from the DLQ topic and attempts an alternative deserialization strategy, for example falling back to raw JSON if the primary Avro conversion fails.
There is a significant limitation that is frequently misunderstood: Connect's DLQ catches converter and transform errors only, not task-level errors. When a converter throws during deserialization or a Single Message Transform fails, the record goes to the DLQ. When a Kafka Connect task itself fails due to a misconfigured sink, an unreachable endpoint, or an unhandled exception in the task code, the task halts entirely and no records are routed to the DLQ. Task-level failures require a different remediation path: investigate the task logs, fix the root cause, and restart the connector task.
DLQ in Kafka Streams
Kafka Streams routes failed records through two exception handler interfaces that address different failure surfaces.
DeserializationExceptionHandler fires when a record cannot be deserialized in the source topology. The built-in LogAndContinueExceptionHandler logs a warning and drops the record. This is not a DLQ. The record is gone. If your topology is using LogAndContinueExceptionHandler without a custom handler alongside it, you may be losing records without knowing it. A custom handler that routes to a DLQ topic looks like this:
public class DlqDeserializationHandler implements DeserializationExceptionHandler {
private static final String DLQ_TOPIC = "my-app.deserialization-errors";
private final Producer<byte[], byte[]> dlqProducer;
@Override
public DeserializationHandlerResponse handle(
ProcessorContext context,
ConsumerRecord<byte[], byte[]> record,
Exception exception) {
Headers headers = new RecordHeaders(record.headers());
headers.add("dlq-source-topic",
record.topic().getBytes(UTF_8));
headers.add("dlq-source-partition",
ByteBuffer.allocate(4).putInt(record.partition()).array());
headers.add("dlq-source-offset",
ByteBuffer.allocate(8).putLong(record.offset()).array());
headers.add("dlq-exception-class",
exception.getClass().getName().getBytes(UTF_8));
dlqProducer.send(new ProducerRecord<>(
DLQ_TOPIC, record.partition(),
record.key(), record.value(), headers));
return DeserializationHandlerResponse.CONTINUE;
}
}
Register the handler in your Streams configuration:
props.put(
StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG,
DlqDeserializationHandler.class
);
ProductionExceptionHandler covers exceptions that occur when Kafka Streams attempts to write output records: serialization failures on the output side, or produce errors to destination topics. A custom ProductionExceptionHandler can route these to a separate DLQ topic using the same pattern.
Comparing the three implementation paths
The mistakes that break a DLQ in production
Losing ordering on replay
When you publish to a DLQ topic, the key you use determines which partition the record lands on. If you change the key or publish with no key, Kafka assigns the record to a partition based on the new (or null) key. When you later replay from the DLQ back to the source topic, the record arrives on a different partition than the original, which breaks per-key ordering guarantees for any downstream consumer that depends on them.
The correct approach is to preserve the original message key when writing to the DLQ. Spring Kafka's DeadLetterPublishingRecoverer does this by default. In custom implementations, explicitly copy the original key. For replay to maintain ordering correctly, your DLQ topic should have the same partition count as the source topic, or you should have a deliberate strategy for re-keying on replay and an understanding of the ordering consequences.
Schema mismatch on the DLQ itself
One common reason a message ends up in a DLQ is a schema mismatch: the consumer cannot deserialize the record against the expected Schema Registry subject. If your DLQ topic is configured to use the same Schema Registry subject as the source topic, you have a circular problem: you still cannot deserialize the record in order to write it to the DLQ, because the deserialization failure is precisely what you are trying to park.
The solution is to write raw bytes to the DLQ rather than using Schema Registry-managed serialization for the DLQ value. Store failure context as record headers: the original schema ID from the Confluent wire format prefix, the source topic and offset, and the exception type. When a reader later inspects the DLQ, the headers tell it what schema was expected. As the source topic evolves through new schema versions, a raw-bytes DLQ remains readable regardless of schema state.
Infinite retry loops between source and retry topic
If your retry-topic consumer routes exhausted messages back to the source topic, and the source consumer routes failed messages back to the retry topic, you have a loop. Both hops may seem individually reasonable, but together they create a circuit that can spin indefinitely on a message that will never succeed.
The mitigation is a retry-count header. Before publishing a message to a retry topic, read the current retry count from the headers, increment it, and write it back. When the count reaches the retry budget, route to the DLQ instead of the retry topic. Spring Kafka's DefaultErrorHandler manages this internally. In custom implementations, you need to track it explicitly and test that the budget is actually enforced before deploying to production.
Idempotency on replay
Replaying a record from the DLQ re-runs your consumer logic. If your consumer executed side effects before the failure occurred - inserting a database row, sending an external notification, initiating a payment - replaying the record will execute those side effects again. Blind replay is safe only for consumers whose operations are idempotent.
Before building replay tooling, audit your consumer for side effects that cannot be safely re-applied. Non-idempotent operations need either a deduplication check keyed on the original offset and topic (available in the DLQ record headers) or a transactional pattern that ties the side effect to the Kafka offset commit.
Silent failure when the DLQ producer itself fails
If the DLQ producer cannot connect, times out, or the DLQ topic does not exist, your consumer needs an explicit policy. The options are: pause and alert without committing the offset, commit the offset and accept the loss, or crash the consumer process.
Committing the offset when the DLQ write has failed means the record is gone permanently. For most applications, this is the wrong trade-off. The better default is to not commit, allowing the consumer to restart and reprocess from the last committed offset, combined with an alert that fires when DLQ write failures begin occurring. This requires that your consumer is idempotent for records it may reprocess, which is another reason to audit for idempotency before the first production incident.
Monitoring and replaying the DLQ
Two metrics provide meaningful signal for a DLQ in production.
Consumer lag on the DLQ topic (topic depth) tells you how many records have been parked and not yet consumed by a replay consumer. An increasing depth signals upstream failures. A depth of zero means the DLQ is clear.
Age of the oldest unprocessed record is more operationally useful than depth alone. A DLQ with 200 records that arrived in the last 10 minutes is a different situation from 200 records sitting there for 8 hours. Most monitoring systems expose the timestamp of the earliest unread record in a topic. An alert that fires when any record has been in the DLQ for more than a configured time threshold is more actionable than a pure depth threshold, because it is independent of traffic volume. It stays relevant at both low and high throughput, and it fires whether lag accumulated slowly or all at once.
For replay, the most common approaches are:
- kcat (formerly kafkacat): consume from the DLQ topic and produce to the source or retry topic via stdin/stdout pipe. Sufficient for small-volume one-off replays.
- Kafka MirrorMaker 2: can mirror a DLQ topic to another cluster or reroute it to a different topic, with filtering via record header predicates. Higher operational overhead for straightforward cases.
- Custom replay service: a dedicated consumer that reads from the DLQ and produces to the source, optionally with a transform step to correct invalid records before re-submission. Most production teams build some version of this eventually.
Off-the-shelf replay tooling for Kafka is thinner than what exists in RabbitMQ or SQS ecosystems, and most teams end up scripting or building their own.
A related architectural choice is whether to maintain one DLQ topic per source topic or a single shared DLQ. One-per-source makes depth monitoring unambiguous and simplifies replay routing: you always know which source the records came from, and a depth alert on payments.DLT points directly at the payments consumer. A shared DLQ is simpler to operate at small scale but becomes harder to attribute as the number of upstream consumers grows, since topic depth is no longer associated with a single system. The replay consumer for a shared DLQ also needs to consult the source topic header on each record to determine where to route it.
Managing Kafka DLQ in Kpow
Kafka has no native DLQ outside of Kafka Connect. In all other cases, the pattern is implemented at the application or framework level, and its quality reflects the care taken in the implementation.
The production-grade shape is retry-topic plus DLQ: transient failures should have a bounded retry pass before a message is parked, and the DLQ itself should be the terminal state rather than another retry tier. The operational surface - monitoring for record age, alerting promptly, and replaying records with confidence about idempotency - is what separates a DLQ that exists from a DLQ that is useful during an incident.
If you want a UI for inspecting DLQ topics, monitoring record age, and replaying records back to a source or retry topic with RBAC controls and a full audit trail, take a look at Kpow.
The Clone to Topic feature introduced in Kpow 96.2 performs a byte-level clone of DLQ records to any permitted topic directly from the Data Inspect interface, with per-topic RBAC policies restricting which topics can serve as clone sources and destinations, and a complete audit log of every replay operation. You can try it free for 30 days against any Kafka cluster.