
How Cloudflare uses Apache Kafka in production
Table of contents
Cloudflare has been running Apache Kafka in production since 2014. By July 2022, the company had processed over one trillion messages on its primary inter-service message bus alone, across 14 distinct Kafka clusters and approximately 330 nodes. At peak, the HTTP analytics pipeline alone was ingesting 100 Gbps and 7.5 million requests per second through Kafka. Those numbers reflect more than raw traffic volume: they reflect how deeply Kafka is woven into Cloudflare's control plane, analytics stack, logging infrastructure, and DNS pipeline.
Company overview
Cloudflare operates one of the largest networks in the world, providing content delivery, DDoS protection, DNS resolution, zero-trust security, and developer platform services to millions of customers. Its network spans hundreds of data centers globally and handles a significant share of internet traffic.
Kafka was first adopted in 2014, initially to power analytics, DDoS mitigation, logging, and metrics pipelines. As the company scaled and its microservice footprint grew, Kafka took on a broader role as the backbone for inter-service communication across the control plane.
Cloudflare's Kafka use cases
Kafka at Cloudflare serves several distinct workloads, each with its own cluster or pipeline configuration.
Inter-service messaging via the Messagebus cluster
The largest and most visible use is inter-service communication across the control plane. Cloudflare's Application Services team built a general-purpose cluster called "Messagebus" to decouple microservices by propagating resource lifecycle events: the creation, modification, or deletion of any resource is published as a Protobuf-encoded message and consumed by any interested downstream service.
The alert notification system (ANS) is one example built on this pattern. Teams produce to an alert topic with a YAML configuration and a single client import; the system handles delivery to Slack, Google Chat, webhooks, PagerDuty, and email automatically. Communication preferences management across multiple systems also runs on the same bus.
HTTP and DNS analytics
From 2014 onwards, Kafka has been the ingestion layer for HTTP request logs and DNS query logs from Cloudflare's edge. In 2018, the HTTP analytics pipeline was processing an average of 6 million requests per second (peaking at 8 million), with the Kafka topic receiving up to 100 Gbps ingress and 7.5 million messages per second at peak.
Messages in this pipeline were encoded in Cap'n Proto format. Downstream consumers initially wrote aggregated results to PostgreSQL; the pipeline was later migrated to ClickHouse, where 106 Go consumers extract more than 100 fields per message for ingestion.
DDoS mitigation
Traffic telemetry from the edge is distributed to DDoS detection consumers via Kafka, giving the mitigation system a real-time feed of traffic patterns without tight coupling to the log collection infrastructure.
Internal logging pipeline
Service logs across Cloudflare's infrastructure flow through a multi-stage pipeline into Kafka before reaching consumers. As of 2024, this pipeline handles approximately 1 million log lines per second. Kafka buffers logs at two core data centers ("log-a" and "log-b"), providing decoupling, fault tolerance, and up to eight hours of consumer outage tolerance before data loss risk arises.
DNS zone builds
DNS record changes submitted via the API trigger PostgreSQL events that are converted into Kafka messages consumed by the Zone Builder. The pipeline supports two consumer types: a full zone build scheduler and a per-record build scheduler. As of 2022, it handles an average of 250 DNS record changes per second, representing 25x growth from initial deployment.
Scale and throughput
Cloudflare's Kafka architecture
Cluster topology
Cloudflare runs 14 Kafka clusters across multiple data centers, deployed on bare metal alongside Kubernetes and databases in its control plane. The clusters are purpose-built: the general-purpose Messagebus cluster handles inter-service communication, while separate clusters serve analytics, logging, DNS, and other high-throughput workloads. Infrastructure is managed with Salt under a GitOps model.
Message format and topic design
The Messagebus cluster standardised on Protocol Buffers (Protobuf) as the message serialization format, chosen over JSON and Apache Avro for its strict typing, forward and backward compatibility guarantees, and multi-language code generation. The team enforces a single Protobuf message type per topic to prevent format incompatibility across consumers.
Schema governance runs through an internal "Messagebus Schema" service that maintains a central repository of all Protobuf schemas, maps team ownership to topics, and uses prototool to detect and flag breaking schema changes before they reach production. Teams are notified via chat integration when schema changes are proposed.
Messagebus-Client library
The Application Services team built an internal Go library called Messagebus-Client that wraps the Shopify Sarama Kafka client. It provides opinionated configuration defaults, automatic mTLS certificate rotation, and built-in Prometheus metrics exposure for both producers and consumers. Default producer metrics include message delivery success and failure rates; default consumer metrics include consumption success and error rates. This library is the standard entry point for Cloudflare's internal teams producing and consuming on the Messagebus cluster.
Connector Framework
For workloads that require moving data from a source system into Kafka, or from Kafka into a sink, Cloudflare built a Connector Framework on top of Kafka connectors. New connector services are generated from Cookiecutter templates and configured entirely via environment variables: readers, writers, and transformations are declared without writing custom integration code. Each generated connector includes built-in metrics and alerting. The framework supports Kafka and Cloudflare's internal Quicksilver key-value store as both source and sink targets.
Logging pipeline architecture
The internal logging pipeline follows a defined path: services emit logs via stdout/stderr, which are captured by systemd-journald and forwarded through syslog-ng. syslog-ng adds metadata (hostname, data center) and forwards logs to two core data centers. At those core data centers, logs are buffered in Kafka before downstream consumers pull from them.
Kafka partitioning in this pipeline uses a composite key of host name and service name. This guarantees message ordering within a partition, but creates uneven partition sizes because different machines generate different log volumes. As of 2024, the team was working on improving partition balancing at scale.
DNS pipeline architecture
The DNS zone build pipeline connects PostgreSQL to the Zone Builder via Kafka. When a DNS record is changed via the API, a PostgreSQL trigger emits an event that is converted to a Kafka message. The Zone Builder runs two consumer types: a full zone build scheduler for complete zone rebuilds, and a per-record build scheduler for incremental changes. The per-record approach reduced build times on large zones (1 million records) from approximately 34 seconds to 6-8 milliseconds, a 4,250x improvement.
Special techniques and engineering innovations
Offset-based consumer health checks
One of the more notable techniques Cloudflare published is their approach to detecting unhealthy Kafka consumers. A standard connectivity check confirms that a consumer is connected to the broker and can read from the topic, but does not confirm that the consumer is actively processing messages. Consumers can deadlock after a rebalancing event and remain connected while making no progress.
Cloudflare solved this by implementing a Kubernetes liveness probe that tracks committed offset advancement rather than connection state. The probe fails if a consumer's committed offset has not changed since the previous check interval, indicating a stalled consumer. An in-memory map tracks offsets per partition. When a rebalance occurs, the system rebuilds the map using Sarama's rebalance signal to ensure each replica only monitors its currently assigned partitions, preventing false failures during reassignment.
Compression with Zstandard
Cloudflare's first compression choice was Snappy, which achieved a 2.25x reduction on the HTTP log topic and 2.6x on DNS logs. After benchmarking multiple algorithms, the team switched to Zstandard (zstd), which reached a 4.5x compression ratio on the HTTP requests topic. The switch saved hundreds of gigabits of network bandwidth and reduced storage footprint substantially.
During the compression work, the team discovered that Kafka was recompressing already-compressed message batches due to offset validation logic in the Sarama Go client library. They diagnosed the issue and contributed a fix upstream to the Sarama project.
Batch consumption for throughput spikes
For high-throughput consumers, particularly those that hit SLO breaches during traffic spikes, Cloudflare implemented batch consumption: consumers process multiple Kafka messages simultaneously before transformation and dispatch, rather than one message at a time. This approach absorbs production spikes without accumulating consumer lag.
OpenTelemetry integration in the SDK
After pandemic-era traffic growth caused a critical consumer to breach its SLA, the team added OpenTelemetry tracing directly into the Messagebus-Client SDK. This gave them visibility across the full processing chain and allowed them to pinpoint the specific bottlenecks contributing to the SLO breach: bucket writes and Kafka reads.
Operating Kafka at scale
Cloudflare runs its Kafka infrastructure on self-managed bare metal. Key operational practices across its 14 clusters include:
Monitoring: Every producer and consumer gets automatically generated Prometheus dashboards covering production rate, consumption rate, and partition skew. The primary alert for all consumers is high consumer lag, generated without manual configuration.
Observability: OpenTelemetry tracing is embedded in the Messagebus-Client SDK, enabling distributed traces that span from message production through broker delivery to consumer processing.
Fault tolerance targets: The logging pipeline Kafka buffer is sized to tolerate up to eight hours of total consumer outage before any data loss risk arises.
Developer experience: Internal wikis document adoption patterns. A ChatOps integration surfaces schema change notifications in team channels. The Connector Framework's Cookiecutter templates generate production-ready connector services with minimal input. A newer tool called Gaia enables push-button provisioning of new Kafka-integrated services aligned with Cloudflare's internal best practices.
Schema governance: The Messagebus Schema service acts as the schema registry for the Messagebus cluster. prototool checks for breaking changes before any schema reaches production.
Infrastructure provisioning: Salt with a GitOps model manages broker configuration and cluster provisioning.
Challenges and how they solved them
Schema coupling despite using a message bus
Early Kafka adoption at Cloudflare used JSON. Despite using Kafka as a decoupling layer, teams remained tightly coupled at the schema level because there was no enforcement of message contracts. The fix was a migration to Protobuf with a strict one-type-per-topic policy, combined with a client-side validation library that validates messages before they are published. This shifted schema errors from runtime consumer failures to producer-side build failures.
Silent consumer failures after rebalancing
Consumers that deadlocked after partition reassignment appeared healthy under basic connectivity checks while making no progress on their partitions. The team addressed this with the offset-comparison liveness probe described above, using Kubernetes to restart stalled consumers automatically. The approach reduced false positives and caught real deadlock conditions that had previously gone undetected.
Disk I/O contention from multiple lagging consumers
Multiple lagging consumers generating random read patterns on spinning disks caused significant I/O contention. The resolution was to migrate those clusters to SSDs. This was an infrastructure cost decision driven by consumer access patterns, and it was preceded by the compression work, which reduced the storage and bandwidth footprint that would otherwise have made the SSD migration more expensive.
SLO breach during pandemic-era traffic spike
A critical consumer fell behind its SLA when traffic surged during the COVID-19 pandemic. The team used OpenTelemetry tracing to identify the specific bottlenecks (bucket writes and Kafka reads) and resolved the lag by implementing batch consumption and adding observability into the SDK to prevent similar blind spots.
Client library over-configuration
Early versions of Messagebus-Client exposed too many configuration knobs. Teams that modified defaults occasionally introduced unintended side effects, including configuration changes that affected other parts of the pipeline. The lesson was to make opinionated defaults the path of least resistance and require explicit justification for deviations. Matt Boyle and Andrea Medda presented this as one of the core tradeoffs at QCon London 2023: the right balance between a highly configurable SDK and a standardised one depends on how much you are willing to invest in documentation and support.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Invest in schema governance before you scale. Cloudflare ran into tight coupling between consumers despite using Kafka as a decoupling layer, because JSON offered no enforcement. Migrating to Protobuf with a strict one-type-per-topic rule and a central schema registry resolved it, but required a migration effort that would have been cheaper to do earlier.
- Connection health is not the same as processing health. A Kubernetes liveness probe that checks TCP connectivity to the broker tells you very little about whether a consumer is making progress. Cloudflare's offset-comparison approach, which checks whether committed offsets advance between intervals, catches deadlocked consumers that a simple connection check would miss entirely.
- Measure compression algorithm choice at your actual message shape. Snappy was a reasonable starting point, but benchmarking on Cloudflare's actual HTTP log messages showed that zstd delivered twice the compression ratio. The specific encoding format (Cap'n Proto, Protobuf, JSON) and message content distribution matter more than general benchmarks.
- Opinionated client libraries reduce operational surface area, but documentation is the cost. The Messagebus-Client library made Kafka accessible to teams without deep Kafka expertise, and the default metrics meant every consumer was observable from day one. The tradeoff was that over-configurability in early versions caused unintended side effects. Cloudflare's recommendation at QCon 2023 was to provide opinionated defaults and invest heavily in documentation so teams don't deviate without understanding the implications.
- Separate clusters by access pattern and criticality. Cloudflare's 14 clusters are not all running the same workload, and the separation matters operationally. The analytics cluster at 106 brokers (2018) was optimised for throughput; the Messagebus cluster was optimised for reliability and schema governance. Running all workloads on a single cluster would have created both operational and performance coupling.
Sources and further reading
If you are running Kafka in production and want better visibility into your cluster, consumer lag, and topic health, Kpow gives you a real-time control plane for Apache Kafka. You can connect it to any Kafka cluster in minutes and try it free for 30 days.