Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Cloudflare uses Apache Kafka in production

Table of contents

Factor House
May 10th, 2026
xx min read

Cloudflare has been running Apache Kafka in production since 2014. By July 2022, the company had processed over one trillion messages on its primary inter-service message bus alone, across 14 distinct Kafka clusters and approximately 330 nodes. At peak, the HTTP analytics pipeline alone was ingesting 100 Gbps and 7.5 million requests per second through Kafka. Those numbers reflect more than raw traffic volume: they reflect how deeply Kafka is woven into Cloudflare's control plane, analytics stack, logging infrastructure, and DNS pipeline.

Company overview

Cloudflare operates one of the largest networks in the world, providing content delivery, DDoS protection, DNS resolution, zero-trust security, and developer platform services to millions of customers. Its network spans hundreds of data centers globally and handles a significant share of internet traffic.

Kafka was first adopted in 2014, initially to power analytics, DDoS mitigation, logging, and metrics pipelines. As the company scaled and its microservice footprint grew, Kafka took on a broader role as the backbone for inter-service communication across the control plane.

Date Event
2014 Kafka first adopted for analytics, DDoS mitigation, logging, and metrics
2018-03 Zstandard (zstd) compression adopted on the HTTP requests topic; HTTP analytics pipeline migrated from PostgreSQL to ClickHouse; analytics cluster at 106 brokers
2022-05 DNS per-record build pipeline launched; handling 250 DNS record changes per second, a 25x increase from initial deployment
2022-07 1 trillion inter-service messages milestone published; 14 clusters, approximately 330 nodes
2023-01 Intelligent offset-based consumer restart mechanism published
2023-03 Matt Boyle and Andrea Medda present "Tales of Kafka @Cloudflare" at QCon London
2024-01 Logging pipeline overview published; approximately 1 million log lines per second; OpenTelemetry Logs migration planned

Cloudflare's Kafka use cases

Kafka at Cloudflare serves several distinct workloads, each with its own cluster or pipeline configuration.

Inter-service messaging via the Messagebus cluster

The largest and most visible use is inter-service communication across the control plane. Cloudflare's Application Services team built a general-purpose cluster called "Messagebus" to decouple microservices by propagating resource lifecycle events: the creation, modification, or deletion of any resource is published as a Protobuf-encoded message and consumed by any interested downstream service.

The alert notification system (ANS) is one example built on this pattern. Teams produce to an alert topic with a YAML configuration and a single client import; the system handles delivery to Slack, Google Chat, webhooks, PagerDuty, and email automatically. Communication preferences management across multiple systems also runs on the same bus.

HTTP and DNS analytics

From 2014 onwards, Kafka has been the ingestion layer for HTTP request logs and DNS query logs from Cloudflare's edge. In 2018, the HTTP analytics pipeline was processing an average of 6 million requests per second (peaking at 8 million), with the Kafka topic receiving up to 100 Gbps ingress and 7.5 million messages per second at peak.

Messages in this pipeline were encoded in Cap'n Proto format. Downstream consumers initially wrote aggregated results to PostgreSQL; the pipeline was later migrated to ClickHouse, where 106 Go consumers extract more than 100 fields per message for ingestion.

DDoS mitigation

Traffic telemetry from the edge is distributed to DDoS detection consumers via Kafka, giving the mitigation system a real-time feed of traffic patterns without tight coupling to the log collection infrastructure.

Internal logging pipeline

Service logs across Cloudflare's infrastructure flow through a multi-stage pipeline into Kafka before reaching consumers. As of 2024, this pipeline handles approximately 1 million log lines per second. Kafka buffers logs at two core data centers ("log-a" and "log-b"), providing decoupling, fault tolerance, and up to eight hours of consumer outage tolerance before data loss risk arises.

DNS zone builds

DNS record changes submitted via the API trigger PostgreSQL events that are converted into Kafka messages consumed by the Zone Builder. The pipeline supports two consumer types: a full zone build scheduler and a per-record build scheduler. As of 2022, it handles an average of 250 DNS record changes per second, representing 25x growth from initial deployment.

Scale and throughput

MetricFigurePipeline / dateCumulative inter-service messagesOver 1 trillionMessagebus cluster, ~8 years to July 2022Total clusters14Across multiple data centers, July 2022Total broker nodes~330All clusters, July 2022HTTP requests topic peak ingress100 Gbps / 7.5M req/secAnalytics cluster, 2018HTTP analytics cluster brokers106 (replication factor: 3)Analytics cluster, 2018Log lines per second (logging pipeline)~1 millionLogging cluster, 2024DNS record changes per second250 (25x growth from launch)DNS pipeline, 2022Minimum replication factor3All clusters

Cloudflare's Kafka architecture

Cluster topology

Cloudflare runs 14 Kafka clusters across multiple data centers, deployed on bare metal alongside Kubernetes and databases in its control plane. The clusters are purpose-built: the general-purpose Messagebus cluster handles inter-service communication, while separate clusters serve analytics, logging, DNS, and other high-throughput workloads. Infrastructure is managed with Salt under a GitOps model.

Message format and topic design

The Messagebus cluster standardised on Protocol Buffers (Protobuf) as the message serialization format, chosen over JSON and Apache Avro for its strict typing, forward and backward compatibility guarantees, and multi-language code generation. The team enforces a single Protobuf message type per topic to prevent format incompatibility across consumers.

Schema governance runs through an internal "Messagebus Schema" service that maintains a central repository of all Protobuf schemas, maps team ownership to topics, and uses prototool to detect and flag breaking schema changes before they reach production. Teams are notified via chat integration when schema changes are proposed.

Messagebus-Client library

The Application Services team built an internal Go library called Messagebus-Client that wraps the Shopify Sarama Kafka client. It provides opinionated configuration defaults, automatic mTLS certificate rotation, and built-in Prometheus metrics exposure for both producers and consumers. Default producer metrics include message delivery success and failure rates; default consumer metrics include consumption success and error rates. This library is the standard entry point for Cloudflare's internal teams producing and consuming on the Messagebus cluster.

Connector Framework

For workloads that require moving data from a source system into Kafka, or from Kafka into a sink, Cloudflare built a Connector Framework on top of Kafka connectors. New connector services are generated from Cookiecutter templates and configured entirely via environment variables: readers, writers, and transformations are declared without writing custom integration code. Each generated connector includes built-in metrics and alerting. The framework supports Kafka and Cloudflare's internal Quicksilver key-value store as both source and sink targets.

Logging pipeline architecture

The internal logging pipeline follows a defined path: services emit logs via stdout/stderr, which are captured by systemd-journald and forwarded through syslog-ng. syslog-ng adds metadata (hostname, data center) and forwards logs to two core data centers. At those core data centers, logs are buffered in Kafka before downstream consumers pull from them.

Kafka partitioning in this pipeline uses a composite key of host name and service name. This guarantees message ordering within a partition, but creates uneven partition sizes because different machines generate different log volumes. As of 2024, the team was working on improving partition balancing at scale.

DNS pipeline architecture

The DNS zone build pipeline connects PostgreSQL to the Zone Builder via Kafka. When a DNS record is changed via the API, a PostgreSQL trigger emits an event that is converted to a Kafka message. The Zone Builder runs two consumer types: a full zone build scheduler for complete zone rebuilds, and a per-record build scheduler for incremental changes. The per-record approach reduced build times on large zones (1 million records) from approximately 34 seconds to 6-8 milliseconds, a 4,250x improvement.

Special techniques and engineering innovations

Offset-based consumer health checks

One of the more notable techniques Cloudflare published is their approach to detecting unhealthy Kafka consumers. A standard connectivity check confirms that a consumer is connected to the broker and can read from the topic, but does not confirm that the consumer is actively processing messages. Consumers can deadlock after a rebalancing event and remain connected while making no progress.

Cloudflare solved this by implementing a Kubernetes liveness probe that tracks committed offset advancement rather than connection state. The probe fails if a consumer's committed offset has not changed since the previous check interval, indicating a stalled consumer. An in-memory map tracks offsets per partition. When a rebalance occurs, the system rebuilds the map using Sarama's rebalance signal to ensure each replica only monitors its currently assigned partitions, preventing false failures during reassignment.

Compression with Zstandard

Cloudflare's first compression choice was Snappy, which achieved a 2.25x reduction on the HTTP log topic and 2.6x on DNS logs. After benchmarking multiple algorithms, the team switched to Zstandard (zstd), which reached a 4.5x compression ratio on the HTTP requests topic. The switch saved hundreds of gigabits of network bandwidth and reduced storage footprint substantially.

During the compression work, the team discovered that Kafka was recompressing already-compressed message batches due to offset validation logic in the Sarama Go client library. They diagnosed the issue and contributed a fix upstream to the Sarama project.

Batch consumption for throughput spikes

For high-throughput consumers, particularly those that hit SLO breaches during traffic spikes, Cloudflare implemented batch consumption: consumers process multiple Kafka messages simultaneously before transformation and dispatch, rather than one message at a time. This approach absorbs production spikes without accumulating consumer lag.

OpenTelemetry integration in the SDK

After pandemic-era traffic growth caused a critical consumer to breach its SLA, the team added OpenTelemetry tracing directly into the Messagebus-Client SDK. This gave them visibility across the full processing chain and allowed them to pinpoint the specific bottlenecks contributing to the SLO breach: bucket writes and Kafka reads.

Operating Kafka at scale

Cloudflare runs its Kafka infrastructure on self-managed bare metal. Key operational practices across its 14 clusters include:

Monitoring: Every producer and consumer gets automatically generated Prometheus dashboards covering production rate, consumption rate, and partition skew. The primary alert for all consumers is high consumer lag, generated without manual configuration.

Observability: OpenTelemetry tracing is embedded in the Messagebus-Client SDK, enabling distributed traces that span from message production through broker delivery to consumer processing.

Fault tolerance targets: The logging pipeline Kafka buffer is sized to tolerate up to eight hours of total consumer outage before any data loss risk arises.

Developer experience: Internal wikis document adoption patterns. A ChatOps integration surfaces schema change notifications in team channels. The Connector Framework's Cookiecutter templates generate production-ready connector services with minimal input. A newer tool called Gaia enables push-button provisioning of new Kafka-integrated services aligned with Cloudflare's internal best practices.

Schema governance: The Messagebus Schema service acts as the schema registry for the Messagebus cluster. prototool checks for breaking changes before any schema reaches production.

Infrastructure provisioning: Salt with a GitOps model manages broker configuration and cluster provisioning.

Challenges and how they solved them

Schema coupling despite using a message bus

Early Kafka adoption at Cloudflare used JSON. Despite using Kafka as a decoupling layer, teams remained tightly coupled at the schema level because there was no enforcement of message contracts. The fix was a migration to Protobuf with a strict one-type-per-topic policy, combined with a client-side validation library that validates messages before they are published. This shifted schema errors from runtime consumer failures to producer-side build failures.

Silent consumer failures after rebalancing

Consumers that deadlocked after partition reassignment appeared healthy under basic connectivity checks while making no progress on their partitions. The team addressed this with the offset-comparison liveness probe described above, using Kubernetes to restart stalled consumers automatically. The approach reduced false positives and caught real deadlock conditions that had previously gone undetected.

Disk I/O contention from multiple lagging consumers

Multiple lagging consumers generating random read patterns on spinning disks caused significant I/O contention. The resolution was to migrate those clusters to SSDs. This was an infrastructure cost decision driven by consumer access patterns, and it was preceded by the compression work, which reduced the storage and bandwidth footprint that would otherwise have made the SSD migration more expensive.

SLO breach during pandemic-era traffic spike

A critical consumer fell behind its SLA when traffic surged during the COVID-19 pandemic. The team used OpenTelemetry tracing to identify the specific bottlenecks (bucket writes and Kafka reads) and resolved the lag by implementing batch consumption and adding observability into the SDK to prevent similar blind spots.

Client library over-configuration

Early versions of Messagebus-Client exposed too many configuration knobs. Teams that modified defaults occasionally introduced unintended side effects, including configuration changes that affected other parts of the pipeline. The lesson was to make opinionated defaults the path of least resistance and require explicit justification for deviations. Matt Boyle and Andrea Medda presented this as one of the core tradeoffs at QCon London 2023: the right balance between a highly configurable SDK and a standardised one depends on how much you are willing to invest in documentation and support.

Full tech stack

CategoryTechnologyRoleMessage brokerApache KafkaCentral message broker across 14 clusters: inter-service bus, analytics, logging, DNS, DDoS telemetryMessage serializationProtocol Buffers (Protobuf)Standard message format on the Messagebus cluster; strict typing and cross-language code generationMessage serialization (analytics)Cap'n ProtoHTTP log encoding in the analytics Kafka pipeline (2018)Kafka client libraryShopify SaramaGo Kafka client wrapped by Messagebus-ClientApplication languageGoPrimary language for all Kafka services, connectors, and the Messagebus-Client SDKApplication languageRustProtobuf schema code generationContainer orchestrationKubernetesRuns Kafka consumers; liveness probes power the offset-based consumer health checkMetricsPrometheusAuto-generated metrics for every Kafka producer and consumer via Messagebus-ClientDashboardsGrafanaVisualisation of production rate, consumption rate, and partition skew per topicDistributed tracingOpenTelemetry / OpenTracingEmbedded in Messagebus-Client SDK for end-to-end trace visibility across producers and consumersAnalytics sinkClickHouseReceives HTTP log data extracted by Kafka consumers; replaced PostgreSQL for analytics aggregationCompressionZstandard (zstd)Message compression on high-throughput topics; 4.5x ratio on the HTTP requests topicSchema governanceprototoolBreaking change detection in the Messagebus Schema registryInfrastructure provisioningSaltBroker configuration management under a GitOps modelService scaffoldingCookiecutterTemplate engine for generating new Connector Framework service scaffoldingInternal service provisioningGaiaPush-button creation of new Kafka-integrated services according to Cloudflare's best practicesLog forwardingsyslog-ngLog collection stage feeding the Kafka logging pipeline; adds hostname and data center metadataLog capturesystemd-journaldCaptures service stdout/stderr and feeds syslog-ngDNS sourcePostgreSQLEmits change triggers that are converted into Kafka messages for the DNS Zone BuilderInternal key-value store (connector sink)QuicksilverCloudflare's internal distributed key-value store; used as a Connector Framework sink target

Key contributors

NameRoleContributionMatt BoyleEngineering Manager, Application ServicesAuthored the 1 trillion messages post; co-presenter at QCon London 2023Andrea MeddaSenior Systems Engineer, CloudflareCo-authored the intelligent consumer restart post; co-presenter at QCon London 2023Chris ShepherdEngineer, CloudflareCo-authored the intelligent consumer restart postIvan BabrouEngineer, CloudflareAuthored the Kafka compression post; led the zstd evaluation and Sarama upstream patchAlex BocharovEngineer, CloudflareAuthored the HTTP analytics pipeline migration postAlex FattoucheEngineer, CloudflareAuthored the DNS build speed improvement post describing the Kafka-backed zone build pipelineColin DouchEngineer, CloudflareAuthored the logging pipeline overview post

Key takeaways for your own Kafka implementation

  • Invest in schema governance before you scale. Cloudflare ran into tight coupling between consumers despite using Kafka as a decoupling layer, because JSON offered no enforcement. Migrating to Protobuf with a strict one-type-per-topic rule and a central schema registry resolved it, but required a migration effort that would have been cheaper to do earlier.
  • Connection health is not the same as processing health. A Kubernetes liveness probe that checks TCP connectivity to the broker tells you very little about whether a consumer is making progress. Cloudflare's offset-comparison approach, which checks whether committed offsets advance between intervals, catches deadlocked consumers that a simple connection check would miss entirely.
  • Measure compression algorithm choice at your actual message shape. Snappy was a reasonable starting point, but benchmarking on Cloudflare's actual HTTP log messages showed that zstd delivered twice the compression ratio. The specific encoding format (Cap'n Proto, Protobuf, JSON) and message content distribution matter more than general benchmarks.
  • Opinionated client libraries reduce operational surface area, but documentation is the cost. The Messagebus-Client library made Kafka accessible to teams without deep Kafka expertise, and the default metrics meant every consumer was observable from day one. The tradeoff was that over-configurability in early versions caused unintended side effects. Cloudflare's recommendation at QCon 2023 was to provide opinionated defaults and invest heavily in documentation so teams don't deviate without understanding the implications.
  • Separate clusters by access pattern and criticality. Cloudflare's 14 clusters are not all running the same workload, and the separation matters operationally. The analytics cluster at 106 brokers (2018) was optimised for throughput; the Messagebus cluster was optimised for reliability and schema governance. Running all workloads on a single cluster would have created both operational and performance coupling.

Sources and further reading

#SourceAuthorDate1Using Apache Kafka to process 1 trillion inter-service messagesMatt Boyle, Cloudflare Blog2022-072Squeezing the firehose: getting the most from Kafka compressionIvan Babrou, Cloudflare Blog2018-033HTTP analytics for 6M requests per second using ClickHouseAlex Bocharov, Cloudflare Blog2018-034Tales of Kafka at Cloudflare: lessons learnt on the way to 1 trillion messagesMatt Boyle, Andrea Medda — InfoQ20235Intelligent, automatic restarts for unhealthy Kafka consumersChris Shepherd, Andrea Medda — Cloudflare Blog2023-016An overview of Cloudflare's logging pipelineColin Douch, Cloudflare Blog2024-017How we improved DNS record build speed by more than 4,000xAlex Fattouche, Cloudflare Blog2022-058Tales of Kafka at Cloudflare: Andrea Medda and Matt Boyle at QCon London 2023InfoQ editorial, QCon London 20232023-04

If you are running Kafka in production and want better visibility into your cluster, consumer lag, and topic health, Kpow gives you a real-time control plane for Apache Kafka. You can connect it to any Kafka cluster in minutes and try it free for 30 days.