Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Reddit uses Apache Kafka in production

Table of contents

Factor House
May 11th, 2026
xx min read

Reddit runs one of the largest self-managed Kafka fleets in the industry: more than 500 brokers, over a petabyte of live data, and tens of millions of messages per second. In early 2026, the platform engineering team migrated that entire fleet from Amazon EC2 to Kubernetes with zero client-side connection-string changes, while the site remained live throughout.

Apache Kafka sits at the centre of Reddit's event infrastructure, carrying clickstream data, real-time safety signals, vote integrity checks, and ads pipeline events across a platform with hundreds of millions of monthly active users.

Company overview

Reddit is a social news aggregation and discussion platform where users submit content, vote posts and comments up or down, and organise communities around shared interests. At the time of its 2019 infrastructure overhaul, Reddit reported 330 million monthly active users, 12 million posts per month, and 2 billion votes per month.

Reddit adopted Kafka as part of a broader infrastructure rationalisation that began around 2017, when the engineering team started migrating services from manually managed Puppet configurations to Terraform. Kafka and ZooKeeper were among the first services converted. The driver was operational complexity: broker replacement required a 14-step manual runbook, ZooKeeper and Kafka were provisioned with ad hoc scripts, and the team had no reliable way to reproduce or version configuration changes.

Date Event
Pre-2017 Kafka on EC2 with manual Puppet configs; broker replacement required a 14-step runbook
2017 Large-scale migration to Terraform begins; Kafka and ZooKeeper are among the first services converted
January 2019 Krishnan Chandra presents Reddit's Terraform-based Kafka management at HashiCorp; broker replacement reduced to three Terraform steps
2021 ksqlDB-backed event QA tool and vote-manipulation detection pipeline presented at Kafka Summit Americas
2021 REV1 real-time safety system (Kafka and Flink Stateful Functions) presented at Flink Forward Global
October 2023 REV2 (Rule Execution V2) published: per-action-type Protobuf Kafka topics, GitHub/S3 rule versioning, time-travel via offset resets
2025–2026 Entire Kafka fleet (500+ brokers, 1+ petabyte of live data) migrated from EC2 to Kubernetes using Strimzi; ZooKeeper-to-KRaft migration follows as a separate phase

Reddit's Kafka use cases

Clickstream and event ingestion

Kafka's longest-standing role at Reddit is as the backbone for clickstream and tracking-event ingestion. Events from web and mobile clients flow through load balancers to stateless application servers and into Kafka. From there, a stream-processing layer routes events to downstream consumers including Hive/S3 archival, BigQuery, and real-time safety systems. This pipeline was in place by 2019 and has remained the foundation of Reddit's event infrastructure since.

Real-time event QA

Reddit's data engineering team, led by Hannah Hagen and Paul Kiernan, built an internal QA tool that streams events directly from the production Kafka pipeline using ksqlDB. Engineers deploying a new app version can filter events by user ID, device ID, or interaction type and receive feedback on whether events are firing correctly within seconds. Before this tool, developers had to wait approximately two hours for events to appear in the data warehouse. The tool is backed by both ksqlDB and Kafka Streams.

Vote-manipulation detection

Reddit's Anti-Evil Engineering team replaced hourly Airflow batch jobs with a ksqlDB streaming pipeline for detecting vote manipulation and derogatory content. Derek Hsieh presented the approach at Kafka Summit Americas 2021. Detection latency dropped from hours to minutes, with Kafka providing the continuous event stream that ksqlDB queries against.

Real-time safety actioning

Reddit's real-time safety applications team uses Kafka as both the ingestion and egress layer for detecting and actioning policy-violating content. The first version of this system (REV1), presented at Flink Forward Global 2021 by Frédérique Mittelstaedt, Bhavani Balasubramanyam, and Vignesh Raja, used Kafka topics to carry post and comment events into Flink Stateful Functions, which dispatched messages to remote Python service endpoints executing Lua-based rules. Action messages then flowed to Kafka egress topics, where Safety Actioning Workers consumed them and applied the corresponding platform action.

The successor system, REV2, published by Reddit Safety Engineering in October 2023, extended this architecture with per-action-type Kafka egress topics, Protobuf-formatted action messages, GitHub-backed rule versioning with S3 distribution, and a time-travel feature implemented via consumer group offset resets.

Ads data pipeline

Reddit operates a dedicated Kafka engineering team for ads data infrastructure. This team builds and maintains Kafka consumers using Flink and Spark as the downstream processing layer, serving all of Reddit's Ads engineering teams.

Scale and throughput

At the time of Reddit's 2025-2026 EC2-to-Kubernetes migration, the Kafka fleet comprised:

Metric Value
Brokers 500+
Live data held in Kafka More than 1 petabyte
Throughput Tens of millions of messages per second
Client services requiring reconfiguration 250+

The 2019 platform context provides additional background: at the time Krishnan Chandra presented Reddit's Terraform-based Kafka management, Reddit had 330 million monthly active users generating 12 million posts and 2 billion votes per month. No public breakdown of topic count, partition count, or consumer group count has been published by Reddit's engineering team.

Reddit's Kafka architecture

Deployment model

Reddit's Kafka deployment has changed substantially since 2019. The original model was self-managed Kafka on Amazon EC2, provisioned through Terraform modules and Puppet, with operators applying changes directly from their workstations. By 2025, the team had migrated the entire fleet to Kubernetes, managed via the Strimzi operator.

Control plane

Reddit originally used ZooKeeper for Kafka metadata management. After completing the data-plane migration to Kubernetes, the team migrated the control plane from ZooKeeper to KRaft as a separate, sequenced phase.

Safety pipeline (REV2)

The REV2 architecture follows this sequence:

  1. Kafka ingress topics receive post and comment events from upstream producers.
  2. Flink Stateful Functions processes messages and dispatches them to remote Python service endpoints.
  3. The Python endpoints execute Lua-based safety rules.
  4. Rule outputs are written as Protobuf-formatted messages to per-action-type Kafka egress topics.
  5. Safety Actioning Workers consume egress topics and apply the corresponding platform action.
  6. Rule configurations are stored in S3, pushed from GitHub CI. A Kubernetes sidecar in each worker pod polls S3 for updates, enabling rule changes without a full service redeploy.
  7. The time-travel feature resets consumer group offsets to replay historical content through newly published rules.

Schema management

Reddit uses Protobuf as the serialisation format for action messages on REV2 Kafka egress topics. Each action type has its own dedicated topic with a Protobuf schema enforced per topic.

Clickstream pipeline

As of 2019, the event ingestion path ran: mobile and web clients to load balancers, to stateless application servers, into Kafka, then to a stream-processing layer, with archival to Hive/S3 and analytics writes to BigQuery.

Stream processing

Reddit uses ksqlDB and Kafka Streams for the event QA tool and vote-manipulation detection, and Apache Flink Stateful Functions for the REV2 safety system.

Special techniques and engineering innovations

DNS facade for zero-downtime broker migration

Before moving any broker from EC2 to Kubernetes, Reddit introduced an intermediate DNS layer: client applications connected to infrastructure-controlled DNS records rather than directly to broker hostnames. This decoupled client connection strings from physical broker addresses, meaning no client application needed to update its configuration during the migration. Over 250 client services were effectively transparent to the change.

Forking Strimzi for a mixed-cluster transition

Strimzi does not natively support hybrid clusters spanning EC2 and Kubernetes. Reddit forked the operator and introduced targeted modifications:

  • Plaintext inter-broker listeners accessible from both environments
  • Shared ZooKeeper metadata management during the transition period
  • Consistent Cruise Control configuration across EC2 and Kubernetes brokers

After all EC2 brokers were decommissioned, Reddit removed the fork and switched to the standard Strimzi operator.

Broker ID space reorganisation

Strimzi requires low broker IDs for the brokers it manages. Because the existing EC2 brokers occupied the low-ID space, Reddit resolved this by temporarily doubling the cluster: new high-numbered EC2 brokers were added, Cruise Control drained partitions from the original low-numbered brokers, those brokers were decommissioned, and Kubernetes brokers were assigned the freed low IDs.

Reversibility as a design constraint

Each phase of the migration was required to be fully reversible before the team proceeded to the next. Reddit treated non-reversibility as a hard blocker. This constrained the migration timeline but ensured that any unexpected failure could be rolled back without data loss.

KRaft migration sequenced separately

Rather than migrating ZooKeeper to KRaft concurrently with the data-plane move to Kubernetes, Reddit treated them as separate phases: the data-plane migration was completed and stabilised first, and the ZooKeeper-to-KRaft migration followed as a distinct step.

Per-action-type Protobuf topics

In REV2, each safety action type has its own dedicated Kafka egress topic with a Protobuf schema enforced per topic. This gives the team granular per-action monitoring and tighter schema contracts, compared to a single multiplexed output topic.

Consumer group offset resets for retroactive rule application

REV2 implements a time-travel capability by resetting Kafka consumer group offsets. When a new safety rule is published, operators can replay historical content through it by rewinding the consumer group to an earlier offset, without needing a separate historical data store.

Operating Kafka at scale

Terraform-managed lifecycle (pre-Kubernetes)

By 2019, Reddit's Terraform module had reduced broker replacement from a 14-step manual runbook to three operations: terraform taint the broker node, run plan, run apply. The new broker automatically registered with ZooKeeper and updated DNS records. A parallel module managed ZooKeeper ensemble provisioning by the same pattern. The module accepted cluster size and instance type as inputs and handled node discovery, ZooKeeper registration, and security group export automatically.

Kubernetes operator management (post-migration)

After the EC2-to-Kubernetes migration, Kafka clusters are managed declaratively via the Strimzi Kubernetes operator. Configuration changes and upgrades are applied through manifest updates rather than ad hoc commands.

Cruise Control for partition rebalancing

Reddit uses Cruise Control to automate partition reassignment during scaling events and broker replacements. During the 2025-2026 migration, Cruise Control was used to incrementally drain partitions from EC2 brokers before decommissioning them.

Challenges and how they solved them

Strimzi's lack of support for mixed EC2/Kubernetes clusters

Problem: The team needed brokers in both EC2 and Kubernetes to coexist in a single cluster during migration so that partitions could be drained gradually rather than cut over all at once.

Root cause: Strimzi is designed to manage Kafka clusters entirely within Kubernetes and has no built-in support for inter-broker communication across EC2 and Kubernetes networking boundaries.

Solution: Reddit forked Strimzi and added support for plaintext inter-broker listeners accessible from both environments, shared ZooKeeper metadata management, and unified Cruise Control configuration. The fork ran for several weeks during migration, then was retired.

Outcome: Zero-downtime migration. No client application changed its connection string.

Client services tightly coupled to broker hostnames

Problem: 250+ client services held connection strings pointing directly to EC2 broker hostnames. Changing all of them simultaneously was not feasible.

Root cause: No abstraction layer existed between client configuration and physical broker addresses.

Solution: Reddit introduced a DNS facade layer before any broker movement. Clients were redirected to stable DNS names under Reddit's control, while physical broker addresses changed independently.

Outcome: All 250+ client services continued operating without modification throughout the migration.

Broker ID space exhaustion

Problem: Strimzi requires low broker IDs for the brokers it manages. All low IDs were already in use by existing EC2 brokers.

Root cause: Kafka broker IDs are fixed at creation time and cannot be reassigned to a running broker.

Solution: Reddit doubled the cluster by provisioning new high-numbered EC2 brokers, used Cruise Control to drain all partitions from the original low-numbered brokers, decommissioned those brokers, then assigned the freed IDs to Kubernetes brokers.

Outcome: No disruption to producers or consumers during the ID space reorganisation.

Slow and error-prone safety rule deployment (REV1)

Problem: REV1 ran each safety rule as an independent process on Python 2.7 and raw EC2. There was no version control for rules, no staging environment, and no rule history. Deployment required engineers to SSH directly into hosts.

Root cause: REV1 was not built for operational scale. Rules were written and deployed outside of standard engineering workflows.

Solution: REV2 moved rules to GitHub, pushed configurations to S3 via CI, and introduced a Kubernetes sidecar that polls for updates, reducing rule deployment time by approximately 90% and eliminating direct host access.

Outcome: Rules can be updated and rolled back without a full service redeploy, with a complete audit trail in version control.

Two-hour instrumentation feedback lag

Problem: Engineers deploying new app versions had to wait approximately two hours before events appeared in the data warehouse, making it difficult to verify that instrumentation was correct immediately after a release.

Root cause: The data warehouse pipeline introduced significant processing latency between event production and queryable state.

Solution: Hannah Hagen and Paul Kiernan built a ksqlDB-backed web application that filters events directly from the live Kafka pipeline. Feedback is available in seconds rather than hours.

Outcome: Engineers can verify event instrumentation in near real-time during or immediately after a deployment.

Full tech stack

Category Tools Notes
Message broker Apache Kafka Self-managed; migrated from EC2 to Kubernetes in 2025-2026
Cluster management Strimzi Kubernetes operator for declarative Kafka management (post-migration)
Control plane KRaft (post-migration), ZooKeeper (pre-migration) ZooKeeper-to-KRaft migration completed as a separate phase after the data-plane migration
Partition rebalancing Cruise Control Used during broker replacement and migration partition draining
Stream processing ksqlDB, Kafka Streams, Apache Flink (Stateful Functions) ksqlDB and Kafka Streams for event QA and vote-manipulation detection; Flink Stateful Functions for REV2 safety system
Serialisation Protobuf Used for REV2 action messages on per-action-type Kafka egress topics
Infrastructure as code Terraform Kafka and ZooKeeper provisioning and broker lifecycle management from 2017 onwards (pre-Kubernetes)
Configuration management Puppet Pre-Terraform; Terraform modules were partially derived from existing Puppet configs
Container orchestration Kubernetes Hosts Flink, safety workers, Kafka brokers (post-migration), and rule-update sidecars
Cloud Amazon Web Services (EC2, S3) EC2 was the original broker host; S3 stores REV2 rule configurations and event archives
Rule language Lua Safety rules executed by Flink Stateful Functions remote Python endpoints in REV2
Remote function language Python REV2 remote function endpoints; Python 2.7 used in REV1 (deprecated)
CI and version control GitHub Version control for REV2 safety rules; CI pipeline pushes rule configs to S3
Batch processing (legacy) Apache Airflow Replaced by Kafka/ksqlDB streaming for safety detection
Data warehouse BigQuery Downstream analytics store fed from the Kafka event pipeline
Archival Apache Hive, Amazon S3 Archival store for event data processed from Kafka
Downstream processing Apache Spark Used alongside Flink in the ads data pipeline

Key contributors

Name Role Contribution
Hannah Hagen Senior Software Engineer, Data Engineering Co-built the ksqlDB event QA tool; presented at Kafka Summit Americas 2021
Paul Kiernan Staff Software Engineer Co-built the ksqlDB event QA tool; presented at Kafka Summit Americas 2021
Derek Hsieh Software Engineer III, Data Engineering Led the ksqlDB vote-manipulation detection pipeline; presented at Kafka Summit Americas 2021
Frédérique Mittelstaedt Engineering Manager, Real-Time Safety Applications Led design of Reddit's centralised streaming platform on Kafka and Flink; presented at Flink Forward Global 2021
Bhavani Balasubramanyam Software Engineer, Anti-Evil Engineering Developed real-time pipeline for violent content detection; co-presented at Flink Forward Global 2021
Vignesh Raja Software Engineer Worked on Flink Stateful Functions and Kafka for user safety; co-presented at Flink Forward Global 2021
Krishnan Chandra Senior Software Engineer Led Terraform-based Kafka infrastructure management; presented at HashiCorp 2019
Neven Miculinić Senior Software Engineer, Messaging Infrastructure Associated with the EC2-to-Kubernetes Kafka migration project ("Swapping the Engine Mid-Flight", 2026)

Key takeaways for your own Kafka implementation

  • Introduce a DNS abstraction before migrating brokers. Reddit's experience shows that decoupling client connection strings from physical broker addresses is what makes large-scale broker migrations feasible without client-side changes. If you are planning a hosting migration, build the DNS facade first.
  • Sequence control-plane and data-plane migrations separately. Reddit migrated brokers from EC2 to Kubernetes first, stabilised the deployment, and only then migrated from ZooKeeper to KRaft. Attempting both simultaneously increases the blast radius if something goes wrong.
  • Treat reversibility as a hard constraint, not a nice-to-have. Reddit required each migration phase to be fully reversible before proceeding. This slows the timeline but makes it possible to halt and recover if an unexpected issue arises mid-migration.
  • Use per-type topics with enforced schemas for action pipelines. The REV2 architecture uses a distinct Kafka egress topic per action type with a Protobuf schema. If you are building a Kafka-backed action or command pipeline, this pattern makes monitoring and schema evolution substantially more manageable than a multiplexed topic.
  • Kafka consumer group offsets are a reprocessing primitive. Reddit's time-travel feature in REV2 demonstrates that resetting consumer group offsets to replay historical data is a practical operational technique, not only a disaster-recovery option. If your pipeline needs to apply new logic to historical events, this approach avoids the need for a separate historical data store.

Sources and further reading

  1. Hannah Hagen and Paul Kiernan (Reddit) — Live Event Debugging With ksqlDB at Reddit, Kafka Summit Americas 2021
  2. Hannah Hagen and Paul Kiernan (Reddit) — Kafka Summit Americas 2021 slide deck
  3. Derek Hsieh (Reddit) — Catching Vote Manipulation at Reddit", Kafka Summit Americas 2021
  4. Frédérique Mittelstaedt, Bhavani Balasubramanyam, Vignesh Raja (Reddit) — Keeping Redditors Safe With Stateful Functions, Flink Forward Global 2021
  5. InfoQ — Reddit's REV2: Replacing a Batch Safety System with a Real-Time One Using Flink and Kafka, October 2023
  6. Krishnan Chandra (Reddit) — How Reddit's Large Scale Migration to Terraform, HashiCorp, January 2019
  7. Reddit Engineering — Swapping the Engine Mid-Flight: How We Moved a Petabyte of Kafka Data from EC2 to Kubernetes, r/RedditEng, January 2026:
  8. ByteByteGo — summary of Reddit's Kafka migration
  9. Red Hat Developer — Kafka Monthly Digest, February 2026

If you are monitoring a Kafka deployment at the scale described in this article, Kpow provides real-time visibility across brokers, topics, consumer groups, and schema registries. You can connect it to any Kafka cluster and try it free for 30 days.