
How Reddit uses Apache Kafka in production
Table of contents
Reddit runs one of the largest self-managed Kafka fleets in the industry: more than 500 brokers, over a petabyte of live data, and tens of millions of messages per second. In early 2026, the platform engineering team migrated that entire fleet from Amazon EC2 to Kubernetes with zero client-side connection-string changes, while the site remained live throughout.
Apache Kafka sits at the centre of Reddit's event infrastructure, carrying clickstream data, real-time safety signals, vote integrity checks, and ads pipeline events across a platform with hundreds of millions of monthly active users.
Company overview
Reddit is a social news aggregation and discussion platform where users submit content, vote posts and comments up or down, and organise communities around shared interests. At the time of its 2019 infrastructure overhaul, Reddit reported 330 million monthly active users, 12 million posts per month, and 2 billion votes per month.
Reddit adopted Kafka as part of a broader infrastructure rationalisation that began around 2017, when the engineering team started migrating services from manually managed Puppet configurations to Terraform. Kafka and ZooKeeper were among the first services converted. The driver was operational complexity: broker replacement required a 14-step manual runbook, ZooKeeper and Kafka were provisioned with ad hoc scripts, and the team had no reliable way to reproduce or version configuration changes.
Reddit's Kafka use cases
Clickstream and event ingestion
Kafka's longest-standing role at Reddit is as the backbone for clickstream and tracking-event ingestion. Events from web and mobile clients flow through load balancers to stateless application servers and into Kafka. From there, a stream-processing layer routes events to downstream consumers including Hive/S3 archival, BigQuery, and real-time safety systems. This pipeline was in place by 2019 and has remained the foundation of Reddit's event infrastructure since.
Real-time event QA
Reddit's data engineering team, led by Hannah Hagen and Paul Kiernan, built an internal QA tool that streams events directly from the production Kafka pipeline using ksqlDB. Engineers deploying a new app version can filter events by user ID, device ID, or interaction type and receive feedback on whether events are firing correctly within seconds. Before this tool, developers had to wait approximately two hours for events to appear in the data warehouse. The tool is backed by both ksqlDB and Kafka Streams.
Vote-manipulation detection
Reddit's Anti-Evil Engineering team replaced hourly Airflow batch jobs with a ksqlDB streaming pipeline for detecting vote manipulation and derogatory content. Derek Hsieh presented the approach at Kafka Summit Americas 2021. Detection latency dropped from hours to minutes, with Kafka providing the continuous event stream that ksqlDB queries against.
Real-time safety actioning
Reddit's real-time safety applications team uses Kafka as both the ingestion and egress layer for detecting and actioning policy-violating content. The first version of this system (REV1), presented at Flink Forward Global 2021 by Frédérique Mittelstaedt, Bhavani Balasubramanyam, and Vignesh Raja, used Kafka topics to carry post and comment events into Flink Stateful Functions, which dispatched messages to remote Python service endpoints executing Lua-based rules. Action messages then flowed to Kafka egress topics, where Safety Actioning Workers consumed them and applied the corresponding platform action.
The successor system, REV2, published by Reddit Safety Engineering in October 2023, extended this architecture with per-action-type Kafka egress topics, Protobuf-formatted action messages, GitHub-backed rule versioning with S3 distribution, and a time-travel feature implemented via consumer group offset resets.
Ads data pipeline
Reddit operates a dedicated Kafka engineering team for ads data infrastructure. This team builds and maintains Kafka consumers using Flink and Spark as the downstream processing layer, serving all of Reddit's Ads engineering teams.
Scale and throughput
At the time of Reddit's 2025-2026 EC2-to-Kubernetes migration, the Kafka fleet comprised:
The 2019 platform context provides additional background: at the time Krishnan Chandra presented Reddit's Terraform-based Kafka management, Reddit had 330 million monthly active users generating 12 million posts and 2 billion votes per month. No public breakdown of topic count, partition count, or consumer group count has been published by Reddit's engineering team.
Reddit's Kafka architecture
Deployment model
Reddit's Kafka deployment has changed substantially since 2019. The original model was self-managed Kafka on Amazon EC2, provisioned through Terraform modules and Puppet, with operators applying changes directly from their workstations. By 2025, the team had migrated the entire fleet to Kubernetes, managed via the Strimzi operator.
Control plane
Reddit originally used ZooKeeper for Kafka metadata management. After completing the data-plane migration to Kubernetes, the team migrated the control plane from ZooKeeper to KRaft as a separate, sequenced phase.
Safety pipeline (REV2)
The REV2 architecture follows this sequence:
- Kafka ingress topics receive post and comment events from upstream producers.
- Flink Stateful Functions processes messages and dispatches them to remote Python service endpoints.
- The Python endpoints execute Lua-based safety rules.
- Rule outputs are written as Protobuf-formatted messages to per-action-type Kafka egress topics.
- Safety Actioning Workers consume egress topics and apply the corresponding platform action.
- Rule configurations are stored in S3, pushed from GitHub CI. A Kubernetes sidecar in each worker pod polls S3 for updates, enabling rule changes without a full service redeploy.
- The time-travel feature resets consumer group offsets to replay historical content through newly published rules.
Schema management
Reddit uses Protobuf as the serialisation format for action messages on REV2 Kafka egress topics. Each action type has its own dedicated topic with a Protobuf schema enforced per topic.
Clickstream pipeline
As of 2019, the event ingestion path ran: mobile and web clients to load balancers, to stateless application servers, into Kafka, then to a stream-processing layer, with archival to Hive/S3 and analytics writes to BigQuery.
Stream processing
Reddit uses ksqlDB and Kafka Streams for the event QA tool and vote-manipulation detection, and Apache Flink Stateful Functions for the REV2 safety system.
Special techniques and engineering innovations
DNS facade for zero-downtime broker migration
Before moving any broker from EC2 to Kubernetes, Reddit introduced an intermediate DNS layer: client applications connected to infrastructure-controlled DNS records rather than directly to broker hostnames. This decoupled client connection strings from physical broker addresses, meaning no client application needed to update its configuration during the migration. Over 250 client services were effectively transparent to the change.
Forking Strimzi for a mixed-cluster transition
Strimzi does not natively support hybrid clusters spanning EC2 and Kubernetes. Reddit forked the operator and introduced targeted modifications:
- Plaintext inter-broker listeners accessible from both environments
- Shared ZooKeeper metadata management during the transition period
- Consistent Cruise Control configuration across EC2 and Kubernetes brokers
After all EC2 brokers were decommissioned, Reddit removed the fork and switched to the standard Strimzi operator.
Broker ID space reorganisation
Strimzi requires low broker IDs for the brokers it manages. Because the existing EC2 brokers occupied the low-ID space, Reddit resolved this by temporarily doubling the cluster: new high-numbered EC2 brokers were added, Cruise Control drained partitions from the original low-numbered brokers, those brokers were decommissioned, and Kubernetes brokers were assigned the freed low IDs.
Reversibility as a design constraint
Each phase of the migration was required to be fully reversible before the team proceeded to the next. Reddit treated non-reversibility as a hard blocker. This constrained the migration timeline but ensured that any unexpected failure could be rolled back without data loss.
KRaft migration sequenced separately
Rather than migrating ZooKeeper to KRaft concurrently with the data-plane move to Kubernetes, Reddit treated them as separate phases: the data-plane migration was completed and stabilised first, and the ZooKeeper-to-KRaft migration followed as a distinct step.
Per-action-type Protobuf topics
In REV2, each safety action type has its own dedicated Kafka egress topic with a Protobuf schema enforced per topic. This gives the team granular per-action monitoring and tighter schema contracts, compared to a single multiplexed output topic.
Consumer group offset resets for retroactive rule application
REV2 implements a time-travel capability by resetting Kafka consumer group offsets. When a new safety rule is published, operators can replay historical content through it by rewinding the consumer group to an earlier offset, without needing a separate historical data store.
Operating Kafka at scale
Terraform-managed lifecycle (pre-Kubernetes)
By 2019, Reddit's Terraform module had reduced broker replacement from a 14-step manual runbook to three operations: terraform taint the broker node, run plan, run apply. The new broker automatically registered with ZooKeeper and updated DNS records. A parallel module managed ZooKeeper ensemble provisioning by the same pattern. The module accepted cluster size and instance type as inputs and handled node discovery, ZooKeeper registration, and security group export automatically.
Kubernetes operator management (post-migration)
After the EC2-to-Kubernetes migration, Kafka clusters are managed declaratively via the Strimzi Kubernetes operator. Configuration changes and upgrades are applied through manifest updates rather than ad hoc commands.
Cruise Control for partition rebalancing
Reddit uses Cruise Control to automate partition reassignment during scaling events and broker replacements. During the 2025-2026 migration, Cruise Control was used to incrementally drain partitions from EC2 brokers before decommissioning them.
Challenges and how they solved them
Strimzi's lack of support for mixed EC2/Kubernetes clusters
Problem: The team needed brokers in both EC2 and Kubernetes to coexist in a single cluster during migration so that partitions could be drained gradually rather than cut over all at once.
Root cause: Strimzi is designed to manage Kafka clusters entirely within Kubernetes and has no built-in support for inter-broker communication across EC2 and Kubernetes networking boundaries.
Solution: Reddit forked Strimzi and added support for plaintext inter-broker listeners accessible from both environments, shared ZooKeeper metadata management, and unified Cruise Control configuration. The fork ran for several weeks during migration, then was retired.
Outcome: Zero-downtime migration. No client application changed its connection string.
Client services tightly coupled to broker hostnames
Problem: 250+ client services held connection strings pointing directly to EC2 broker hostnames. Changing all of them simultaneously was not feasible.
Root cause: No abstraction layer existed between client configuration and physical broker addresses.
Solution: Reddit introduced a DNS facade layer before any broker movement. Clients were redirected to stable DNS names under Reddit's control, while physical broker addresses changed independently.
Outcome: All 250+ client services continued operating without modification throughout the migration.
Broker ID space exhaustion
Problem: Strimzi requires low broker IDs for the brokers it manages. All low IDs were already in use by existing EC2 brokers.
Root cause: Kafka broker IDs are fixed at creation time and cannot be reassigned to a running broker.
Solution: Reddit doubled the cluster by provisioning new high-numbered EC2 brokers, used Cruise Control to drain all partitions from the original low-numbered brokers, decommissioned those brokers, then assigned the freed IDs to Kubernetes brokers.
Outcome: No disruption to producers or consumers during the ID space reorganisation.
Slow and error-prone safety rule deployment (REV1)
Problem: REV1 ran each safety rule as an independent process on Python 2.7 and raw EC2. There was no version control for rules, no staging environment, and no rule history. Deployment required engineers to SSH directly into hosts.
Root cause: REV1 was not built for operational scale. Rules were written and deployed outside of standard engineering workflows.
Solution: REV2 moved rules to GitHub, pushed configurations to S3 via CI, and introduced a Kubernetes sidecar that polls for updates, reducing rule deployment time by approximately 90% and eliminating direct host access.
Outcome: Rules can be updated and rolled back without a full service redeploy, with a complete audit trail in version control.
Two-hour instrumentation feedback lag
Problem: Engineers deploying new app versions had to wait approximately two hours before events appeared in the data warehouse, making it difficult to verify that instrumentation was correct immediately after a release.
Root cause: The data warehouse pipeline introduced significant processing latency between event production and queryable state.
Solution: Hannah Hagen and Paul Kiernan built a ksqlDB-backed web application that filters events directly from the live Kafka pipeline. Feedback is available in seconds rather than hours.
Outcome: Engineers can verify event instrumentation in near real-time during or immediately after a deployment.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Introduce a DNS abstraction before migrating brokers. Reddit's experience shows that decoupling client connection strings from physical broker addresses is what makes large-scale broker migrations feasible without client-side changes. If you are planning a hosting migration, build the DNS facade first.
- Sequence control-plane and data-plane migrations separately. Reddit migrated brokers from EC2 to Kubernetes first, stabilised the deployment, and only then migrated from ZooKeeper to KRaft. Attempting both simultaneously increases the blast radius if something goes wrong.
- Treat reversibility as a hard constraint, not a nice-to-have. Reddit required each migration phase to be fully reversible before proceeding. This slows the timeline but makes it possible to halt and recover if an unexpected issue arises mid-migration.
- Use per-type topics with enforced schemas for action pipelines. The REV2 architecture uses a distinct Kafka egress topic per action type with a Protobuf schema. If you are building a Kafka-backed action or command pipeline, this pattern makes monitoring and schema evolution substantially more manageable than a multiplexed topic.
- Kafka consumer group offsets are a reprocessing primitive. Reddit's time-travel feature in REV2 demonstrates that resetting consumer group offsets to replay historical data is a practical operational technique, not only a disaster-recovery option. If your pipeline needs to apply new logic to historical events, this approach avoids the need for a separate historical data store.
Sources and further reading
- Hannah Hagen and Paul Kiernan (Reddit) — Live Event Debugging With ksqlDB at Reddit, Kafka Summit Americas 2021
- Hannah Hagen and Paul Kiernan (Reddit) — Kafka Summit Americas 2021 slide deck
- Derek Hsieh (Reddit) — Catching Vote Manipulation at Reddit", Kafka Summit Americas 2021
- Frédérique Mittelstaedt, Bhavani Balasubramanyam, Vignesh Raja (Reddit) — Keeping Redditors Safe With Stateful Functions, Flink Forward Global 2021
- InfoQ — Reddit's REV2: Replacing a Batch Safety System with a Real-Time One Using Flink and Kafka, October 2023
- Krishnan Chandra (Reddit) — How Reddit's Large Scale Migration to Terraform, HashiCorp, January 2019
- Reddit Engineering — Swapping the Engine Mid-Flight: How We Moved a Petabyte of Kafka Data from EC2 to Kubernetes, r/RedditEng, January 2026:
- ByteByteGo — summary of Reddit's Kafka migration
- Red Hat Developer — Kafka Monthly Digest, February 2026
If you are monitoring a Kafka deployment at the scale described in this article, Kpow provides real-time visibility across brokers, topics, consumer groups, and schema registries. You can connect it to any Kafka cluster and try it free for 30 days.