How Datadog uses Apache Kafka in production

Kafka

Factor House·June 2, 2026·19 min read

Datadog processes hundreds of trillions of observability events per day across logs, metrics, and traces, and Apache Kafka sits at the centre of that pipeline. The scale of the problem meant off-the-shelf Kafka tooling and clients eventually hit limits, and Datadog’s engineering teams have built a significant amount of infrastructure around Kafka to keep up: a custom Rust client, a control plane that routes traffic across hundreds of clusters without redeployment, and open-source SRE tooling that has been in production since 2018.

Company overview

Datadog is a cloud-based observability and monitoring platform used by engineering and operations teams to monitor infrastructure, applications, and logs. Its core product ingests telemetry data from customer environments, runs it through real-time processing pipelines, and makes it queryable and alertable at low latency.

The scale of that data ingestion is what drives most of Datadog’s Kafka decisions. As of February 2025, Datadog runs hundreds of Kafka clusters, thousands of topics, millions of partitions, and hundreds of consumer groups, handling terabytes of data per second.

Kafka has been documented in Datadog’s engineering blog since at least 2018, when the SRE team open-sourced kafka-kit to handle the operational overhead of 40+ clusters processing trillions of datapoints per day. The architecture has evolved substantially since then, with the most recent major development being the Streaming Platform: a custom abstraction layer that decouples Kafka clients from physical cluster topology.

Key Kafka milestones:

August 2018: kafka-kit open-sourced, automating partition rebalancing, broker replacements, and replication throttle management across 40+ clusters
June 2019: First detailed public documentation of Datadog’s Kafka operations, covering multi-region deployments and configuration practices
May 2019 (Kafka Summit London): Balthazar Rouberol presents on running Kafka clusters in Kubernetes
May 2022: Husky event store introduced, with Kafka as its primary ingestion front-end
February 2023: Exactly-once ingestion semantics and shard routing in Husky documented publicly
February 2025: Streaming Platform architecture published, revealing the full scope of Datadog’s Kafka infrastructure and the custom Rust client
2025 (multiple posts): CDC replication platform, configuration distribution, and live process metrics pipeline all documented, showing the breadth of Kafka’s role across teams

Datadog’s Kafka use cases

Observability event ingestion

The primary role of Kafka at Datadog is as the durable buffer for the Data Platform. All incoming observability events – logs, spans, and metrics – pass through Kafka before reaching downstream storage and processing systems. Guillaume Bort, a Staff Software Engineer at Datadog, described this as the foundation of the entire observability pipeline in a February 2025 engineering blog post.

Metrics storage ingestion (Husky)

Datadog’s third-generation event store, Husky, uses Kafka as its ingestion layer. Writers in Husky read from Kafka, buffer events briefly in memory, then upload them to blob storage (AWS S3). A Shard Router service reads from one Kafka cluster and writes to a separate output cluster with data organised into logical shards – groups of partitions that map to physical storage units. This design allows Kafka scaling and storage scaling to proceed independently.

Exactly-once ingestion routing

Husky’s second generation introduced a locality concept embedded in the Kafka pipeline: a given event’s tenant ID and timestamp deterministically route it to the same Kafka shard on every ingestion attempt. Combined with FoundationDB for metadata transactions, this eliminates duplicates without requiring stateful writers. Daniel Intskirveli and Cecilia Watt documented this in a February 2023 post.

CDC-based data replication

Kafka is also the backbone of Datadog’s internal change data capture (CDC) replication platform, which handles Postgres-to-Postgres replication, Postgres-to-Apache Iceberg pipelines for analytics, Cassandra replication, and cross-region Kafka replication. Sanketh Balakrishna and Andrew Zhang published the architecture in November 2025, documenting approximately 500 ms end-to-end replication lag for the initial use case.

Live process metrics pipeline

The live process metrics feature uses Kafka for host subscription signalling. The live data server publishes to a host subscriptions topic at 1-second intervals; the intake service consumes from that topic and maintains an in-memory cache of which hosts are currently being viewed, activating high-frequency data collection only for those hosts. Kai Zong Khor and William Yu described this architecture in an August 2025 post.

Configuration distribution

Kafka carries cache-invalidation notifications when tenant configuration changes are written to the database, propagating those changes to downstream services. Gabriel Reid documented this use case in a June 2025 post, noting that a newer v2 design complements Kafka with cloud object storage for this role.

Scale and throughput

As documented by Guillaume Bort in February 2025:

Clusters: Hundreds of Kafka clusters across Kubernetes and cloud environments
Topics: Thousands of topics
Partitions: Millions of partitions
Consumer groups: Hundreds of consumer groups
Daily event volume: Hundreds of trillions of observability events per day
Throughput: Terabytes of data per second

For context, the open-source kafka-kit page describes “double-digit gigabytes per second” bandwidth requiring petabytes of high-performance storage even with short retention windows.

In 2019, Emily Chang documented 40+ Kafka and ZooKeeper clusters processing trillions of datapoints daily across multiple data centres and regions, which gives a sense of how much the fleet has grown since then.

On a smaller scale, one indicative example is the live process metrics pipeline: before an architecture change in 2025, the host subscriptions topic was handling 500,000 messages per second. After redesigning when and what data was collected, that fell to 5,000 messages per second – a 100x reduction – while freeing over 600 CPU cores and 1 TB of memory.

Datadog’s Kafka architecture

The Streaming Platform

The most significant architectural layer above raw Kafka is the Streaming Platform, introduced publicly in February 2025. Rather than having clients connect to physical Kafka clusters, the Streaming Platform provides three logical abstractions:

Streams are multi-cluster logical topics. A Stream has a stable identifier that is independent of the physical cluster topology. Producers and consumers reference a Stream name; the control plane resolves that to one or more physical topics on one or more clusters. This decoupling means the underlying cluster can change without application redeployment.

Stream Lanes provide quality-of-service tiers within a Stream. A single Stream can have multiple lanes: a real-time lane for live traffic, a batch lane for lower-priority or late-arriving data, and a dead-letter queue lane for poison pills. When a malformed message blocks a consumer, the DLQ lane copies it out of the main partition, allowing the partition to make progress.

The Assigner is a custom multi-cluster coordinator that replaces Kafka’s default consumer group coordinator. It monitors cluster health and workload distribution across the entire fleet and can redirect traffic to a different cluster, decommission a cluster, or reallocate partitions – all without consumer redeployment.

libstreaming: the custom Kafka client

All client interaction with the Streaming Platform goes through libstreaming, a unified Kafka client library written in Rust with bindings for Java, Go, and Python. It initially wrapped rdkafka, but rdkafka’s handling of compressed batches (see Challenges below) led Datadog to replace it with a fully custom Kafka client in Rust. libstreaming manages cluster discovery, failover, and the advanced commit log (described below).

Topic organisation (kafka-kit model)

Rather than running one or a few large clusters, Datadog treats topics as first-class capacity units. Topics are grouped into shared pools for smaller workloads or promoted to dedicated pools when they outgrow shared resources. The kafka-kit tooling automates the placement logic, using either even partition distribution (“count” strategy) or storage-based bin-packing (“storage” strategy).

Multi-region redundancy

For critical pipelines, Datadog uses duplicate writes to primary and secondary Kafka clusters. If an unclean leader election occurs on one cluster, the data is recoverable from the other. Emily Chang documented this pattern in 2019 as a response to data loss events caused by out-of-sync replicas being elected leader before Kafka 0.11.0 made exactly-once semantics more broadly available.

Advanced commit log

The standard Kafka consumer offset model tracks a single pointer per partition, which means a consumer must choose between processing live traffic and reprocessing a backlog. Datadog’s Streaming Platform extends the commit metadata to track multiple offsets or offset ranges within a single partition simultaneously. This allows consumers to handle both live traffic and concurrent backlog reprocessing without head-of-line blocking.

Kubernetes deployment

All Kafka clusters run self-managed on Kubernetes, using StatefulSets and persistent volumes. Martin Dickson, a Senior Software Engineer on the Datadog Kafka team, covered this in a March 2024 talk, noting that Datadog operates dozens of self-managed Kubernetes clusters across a multi-cloud environment.

CDC pipeline architecture

The CDC replication platform uses Debezium as the change capture connector for Postgres and Cassandra sources. Debezium serialises captured changes into Avro format and publishes them to Kafka topics, along with schema updates, to a multi-tenant Kafka Schema Registry. Kafka Connect (with single message transforms for field manipulation, topic renaming, and column filtering) handles the fault-tolerant data movement between systems.

Producer architecture

Producers interact with the Streaming Platform through libstreaming, which manages batching and routing transparently. The system uses Zstandard (zstd) compression across pipelines. At the scale Datadog operates, even compression can introduce problems – an incoming compressed batch may expand to gigabytes on decompression, which the custom client addresses by enforcing limits on decompressed batch size rather than compressed size.

Consumer architecture

Consumer group management is handled by The Assigner rather than Kafka’s native group coordinator. Lag monitoring uses wall-clock time rather than pure message-count offsets: the advanced commit log enriches metadata with ingestion timestamps, and the Streaming Platform computes time lag by consuming __consumer_offsets topics across all clusters. This allows automated rebalancing to trigger based on actual data age.

For the live process metrics pipeline, each entry in the Kafka-backed cache includes a TTL, so hosts expire automatically when collection stops without requiring an explicit deletion event.

Special techniques and engineering innovations

Live traffic failover without redeployment

The Streaming Platform’s Assigner can redirect all traffic for a Stream from one cluster to another in seconds. The control plane creates a new topic on the target cluster, redirects producers and consumers, then drains the old topic. The same mechanism handles proactive load redistribution and cluster decommissioning without any application-side changes.

Relaxed ordering for parallelism

Datadog moved away from strict per-partition ordering for the main observability pipeline. Rather than relying on Kafka partition order for correctness, events are processed with at-least-once delivery and ordering is deferred to the downstream Husky storage layer. This unlocks substantially greater parallelism at petabyte scale.

Deterministic shard routing for deduplication

Husky’s shard routing assigns events to Kafka shards based on tenant ID and timestamp. Because the mapping is deterministic, the same event will always land in the same shard regardless of how many times it is re-ingested. Combined with FoundationDB for transactional metadata, this provides exactly-once semantics without stateful writers or distributed locking.

Storage-based partition rebalancing

kafka-kit’s topicmappr tool supports two placement strategies: count-based (even partition distribution across brokers) and storage-based (bin-packing partitions according to their current storage metrics). The storage strategy integrates with Datadog’s own metrics API to pull live partition size data, and includes rack-aware replica placement. This reduces the manual effort of broker replacements and rebalances at a fleet of hundreds of clusters.

TTL-based subscription expiry

In the live process metrics pipeline, host subscription state in the Kafka-backed cache includes a TTL. This handles ungraceful terminations: if a service crashes without sending a deletion event, its subscription entries expire naturally, preventing the intake service from continuing to collect data for a host no one is viewing.

Operating Kafka at scale

Deployment model: Entirely self-managed on Kubernetes, across dozens of Kubernetes clusters in a multi-cloud environment. Datadog does not use Confluent Cloud, MSK, or other managed Kafka services for its production clusters.

Monitoring: Datadog monitors its Kafka infrastructure with its own platform, tracking MBean metrics including MessagesInPerSec, BytesInPerSec, and ISR shrinks/expands. The kafka-kit autothrottle tool integrates directly with the Datadog metrics API to manage replication throttle rates dynamically during broker replacements.

Segment configuration tuning: For low-throughput topics, default Kafka segment settings can cause log retention to exceed the configured topic-level retention. Emily Chang documented reducing segment.ms to 43,200,000 ms (12 hours) and segment.bytes to approximately 100 MB to align segment lifecycle with topic-level retention. Open file handles are monitored to ensure frequent segment rollouts do not exhaust OS limits.

Offset retention: Consumer offset retention was extended from the default 1 day to 7 days, preventing data reprocessing on topics where consumer groups are temporarily offline.

Configuration change testing: Configuration changes are validated on replicated or mirrored clusters before being applied to production. ISR shrinks/expands and segment sizes are monitored continuously during rollouts.

Developer tooling: The kafka-kit suite (open-sourced August 2018, maintained through at least v4.2.1 in July 2023) handles topic mapping, partition rebalancing, broker replacements, and replication throttle management. It is available at github.com/DataDog/kafka-kit.

Schema governance: The CDC replication platform uses a multi-tenant Kafka Schema Registry with backward compatibility enforced. Schema changes are validated against the registry before application, preventing pipeline breakage from schema drift.

Challenges and how they solved them

Static cluster bindings preventing rapid failover

Applications were bound to physical Kafka cluster addresses at deploy time. Redirecting traffic to a different cluster required a full redeployment cycle. Datadog built the Streaming Platform abstraction so that producers and consumers reference logical Stream names, not cluster addresses. The Assigner resolves those names to physical clusters and can redirect traffic in seconds. The result is zero-downtime cluster failover and live traffic migration without touching application configuration.

Single-pointer offset model blocking concurrent backlog processing

Kafka’s standard offset model gives each consumer group a single pointer per partition. Reprocessing a backlog means abandoning live traffic on that partition, or running a separate consumer group with separate infrastructure. Datadog built a custom commit log that stores multiple offsets or offset ranges in a partition’s commit metadata simultaneously. Consumers can process live traffic and catch up on a backlog at the same time, on the same partition.

zstd compression bombs crashing consumers

The rdkafka client – which Datadog initially used as the foundation of libstreaming – limits batch sizes by compressed size. A message batch that is small when compressed can expand to gigabytes on decompression if the payload is highly repetitive. At Datadog’s ingestion rate, compressed batches large enough to crash consumer processes were not a theoretical problem. Datadog replaced rdkafka with a custom Rust Kafka client that enforces decompressed size limits. Consumer crashes from oversized decompressed batches were eliminated.

Offset-based lag insufficient for operational decisions

Standard Kafka consumer lag is expressed in message count: how many messages behind the latest offset the consumer group sits. That figure provides no information about how old the unprocessed data is. A consumer group that is 10 million messages behind on a high-throughput topic may be seconds behind in wall-clock time; on a low-throughput topic, those same 10 million messages might represent days of data. Datadog enriched commit metadata with ingestion timestamps and built wall-clock time lag computation by consuming __consumer_offsets topics across all clusters. Automated rebalancing now triggers on actual data age rather than message count.

Unclean leader elections causing data loss

In multi-region deployments running pre-0.11.0 Kafka (which enabled unclean leader elections by default), an out-of-sync replica could be elected leader when the in-sync leader became unavailable, discarding the unreplicated messages. Datadog’s response was duplicate writes: data written to both a primary and a secondary Kafka cluster. A data loss event on one cluster left the data intact on the other. The combination of Kafka 0.11.0’s changes to leader election defaults and duplicate-write redundancy eliminated permanent data loss from this failure mode.

Log segment retention exceeding topic-level retention on low-throughput topics

Kafka’s default segment.ms value is 7 days. On topics with a 36-hour retention setting and low message volume, segments stayed open for the full 7 days before rolling, meaning data was retained for longer than the configured policy. Reducing segment.ms to 12 hours and segment.bytes to approximately 100 MB brought segment lifecycle in line with topic-level retention.

Head-of-line blocking from poison pills and traffic spikes

A malformed message in a partition blocks all consumers on that partition until it is handled, and a traffic spike from one high-volume workload can starve other workloads sharing the same partitions. Stream Lanes within the Streaming Platform address both problems. Real-time, batch, and dead-letter workloads run on separate lanes within the same logical Stream. When the system detects a poison pill, it copies the message to the DLQ lane, allowing the main lane to advance. Different workload types can no longer block each other at the partition level.

Live process pipeline consuming 500,000 messages/second for inactive data

The intake service was receiving all process metrics for all hosts, even those that no user was actively viewing. The Kafka-consuming service maintained in-memory state for the entire host fleet. Kai Zong Khor and William Yu redesigned the pipeline so that the live data server only activates 2-second collection intervals for hosts currently being viewed, and signals this state to the intake service via Kafka at 1-second intervals. Peak throughput on the host subscriptions topic dropped from 500,000 to 5,000 messages per second, and the supporting infrastructure was scaled down by 98%.

Full tech stack

Category	Tools	Notes
Message broker	Apache Kafka	Self-managed; specific version not disclosed publicly
Kafka client	libstreaming (custom, Rust)	Replaces rdkafka; bindings for Java, Go, Python
Kafka client (legacy/external)	rdkafka / librdkafka	Still used in some services and Observability Pipelines
Kafka client (Go)	DataDog/confluent-kafka-go	Datadog fork of the Confluent Go client, wrapping librdkafka
Stream processing control plane	The Assigner (custom)	Replaces Kafka’s native group coordinator for the Streaming Platform
Schema registry	Kafka Schema Registry (multi-tenant)	Configured for backward compatibility; used in CDC pipelines
Serialisation format	Apache Avro	Used in CDC event pipelines
Change data capture	Debezium	Source connector for Postgres and Cassandra
Data movement	Kafka Connect	With single message transforms for CDC pipelines
Metadata store	FoundationDB	Transactional metadata for exactly-once ingestion in Husky
Blob storage	AWS S3	Target for Husky event store Writers
Analytics target	Apache Iceberg	Target for Postgres-to-analytics CDC pipelines
Compression	Zstandard (zstd)	Compression algorithm for Kafka messages
Kafka operations tooling	kafka-kit (open source, Go)	Covers topicmappr, autothrottle, registry, metricsfetcher
Orchestration	Kubernetes (StatefulSets)	Self-managed, multi-cloud, dozens of clusters
Monitoring	Datadog	Kafka clusters monitored with Datadog’s own platform

Key contributors

Guillaume Bort (Staff Software Engineer, Datadog): Led the Streaming Platform, libstreaming, and the custom Rust Kafka client. February 2025 engineering post
Jamie Alquiza (Staff Software Engineer, Datadog): Creator of kafka-kit; co-lead of early Kafka infrastructure scaling. kafka-kit introduction post, GitHub repo
Balthazar Rouberol (Data Reliability Engineer Team Lead, Datadog): Co-lead of Kafka and Cassandra infrastructure; presented “Running Production Kafka Clusters in Kubernetes” at Kafka Summit London 2019
Emily Chang: Authored the 2019 “Lessons learned from running Kafka at Datadog” post. Engineering blog
Richard Artoul and Cecilia Watt: Introduced the Husky event store and Kafka’s role as its ingestion layer. May 2022 post
Daniel Intskirveli and Cecilia Watt: Documented Husky’s exactly-once ingestion and shard routing. February 2023 post
Sanketh Balakrishna and Andrew Zhang: Designed and published the CDC replication platform. November 2025 post
Kai Zong Khor and William Yu: Redesigned the live process metrics pipeline, achieving a 100x throughput reduction. August 2025 post
Gabriel Reid: Documented configuration distribution via Kafka. June 2025 post
Martin Dickson (Senior Software Engineer, Kafka team): Covered self-managed Kafka on Kubernetes in a 2024 summit talk. Episode

Key takeaways for your own Kafka implementation

Decouple clients from physical topology early. Binding producers and consumers directly to cluster addresses makes failover expensive. Datadog built an abstraction layer so applications reference logical stream names. If you are running multiple clusters, even a lightweight routing layer pays off as the fleet grows.
Message-count lag is an incomplete signal. Whether 10 million lagging messages represents a problem depends entirely on how old they are. Enriching commit metadata with ingestion timestamps and computing wall-clock lag gives operations teams a signal they can actually act on.
Client-side compressed batch size limits are by compressed size, not decompressed size. At high throughput with zstd compression, this can cause decompressed batches to overwhelm consumer memory. If you are running high-compression pipelines at scale, verify how your client handles this and whether you need a custom or patched client.
Segment configuration affects retention on low-throughput topics. The default segment.ms of 7 days can cause data to be retained well beyond the topic-level retention setting on topics that receive infrequent writes. Tuning segment.ms and segment.bytes for low-throughput topics brings actual retention in line with policy.
QoS tiers within a topic reduce blast radius. Rather than isolating workloads entirely into separate topics or clusters, Datadog’s Stream Lanes provide lightweight isolation within a single logical stream. Separating real-time from batch traffic and routing poison pills to a dead-letter lane reduces the operational surface area without requiring full topic duplication.

Sources and further reading

Primary sources

Guillaume Bort, “Achieving relentless Kafka reliability at scale with the Streaming Platform” (February 2025)
Emily Chang, “Lessons learned from running Kafka at Datadog” (June 2019)
Jamie Alquiza, “Introducing Kafka-Kit: Tools for scaling Kafka” (August 2018)
Richard Artoul and Cecilia Watt, “Introducing Husky, Datadog’s third-generation event store” (May 2022)
Daniel Intskirveli and Cecilia Watt, “Husky: Exactly-once ingestion and multi-tenancy at scale” (February 2023)
Sanketh Balakrishna and Andrew Zhang, “Replication redefined: How we built a low-latency, multi-tenant data replication platform” (November 2025)
Kai Zong Khor and William Yu, “Scaling down to speed up: How we improved efficiency of live process metrics by 100x” (August 2025)
Gabriel Reid, “How we scaled fast, reliable configuration distribution to thousands of workload containers” (June 2025)
Balthazar Rouberol, “Running Production Kafka Clusters in Kubernetes” (Kafka Summit London 2019)
Jamie Alquiza and Balthazar Rouberol, “Datadog on Kafka” podcast (May 2020)
Martin Dickson, “Datadog on Stateful Workloads on Kubernetes” (March 2024)
kafka-kit GitHub repository

Try Kpow with your Kafka cluster

If you are monitoring a Kafka cluster at any scale, you can try Kpow free for 30 days. It connects to any Kafka cluster in minutes and deploys via Docker, Helm, or JAR.