
How DoorDash uses Apache Kafka in production
Table of contents
DoorDash operates one of the largest food delivery platforms in the United States, coordinating millions of orders, restaurant partnerships, and Dasher assignments every day. Underpinning that coordination is an Apache Kafka deployment that spans five clusters, more than 2,500 topics, and around six billion messages per day on average — with peaks reaching twice that rate.
The company reached that scale through two distinct adoption waves: an initial migration from RabbitMQ in mid-2019 to stop task-processing outages, then a ground-up rebuild of its real-time data infrastructure under a platform called Iguazu that replaced Amazon Kinesis and SQS and brought data warehouse latency down from one day to a few minutes.
Company overview
DoorDash connects consumers, restaurants, and delivery drivers (Dashers) across the United States, Canada, Australia, and Japan. The platform processes a high volume of time-sensitive transactions — order placement, restaurant confirmation, Dasher assignment, and delivery completion — all of which depend on low-latency event propagation between microservices.
DoorDash first adopted Kafka in mid-2019 after repeated RabbitMQ failures began causing order checkout and Dasher dispatch outages under peak load. That initial migration, led by engineer Ashwin Kachhara, addressed the immediate reliability problem. A separate initiative, beginning around 2020 and led by Allen Wang (who had previously built Netflix's Keystone Kafka pipeline), took a broader view: replacing multiple fragmented cloud-native queuing systems with a unified, Kafka-centred streaming platform.
Kafka milestones at DoorDash
DoorDash's Kafka use cases
Asynchronous task processing
The original use case. Before mid-2019, DoorDash routed order checkout and Dasher assignment work through RabbitMQ and Celery. Repeated broker failures under peak load caused customer-facing outages with direct revenue impact; RabbitMQ offered limited observability and required slow, manual recovery steps.
Ashwin Kachhara and his team replaced RabbitMQ with Kafka for asynchronous task processing. Kafka's durable log, at-least-once delivery guarantees, and horizontal scalability eliminated the outage pattern. The migration was completed without downtime using a gradual traffic cutover approach.
Real-time event pipeline to Snowflake (Iguazu)
The more architecturally significant initiative. Before Iguazu, DoorDash maintained separate pipelines using Amazon SQS and Amazon Kinesis to feed its data warehouse, ML platform, and time-series metrics backend. Each pipeline used different technology and communication patterns, producing data warehouse latency of up to one day and significant operational overhead.
Iguazu replaced those fragmented pipelines with a single Kafka-centred platform. All microservices and mobile clients publish events via an HTTP-based Kafka proxy; Apache Flink jobs consume those topics and fan out to multiple destinations (Snowflake, Redis, Chronosphere) with format transformations applied per sink. End-to-end latency to Snowflake dropped from one day to a few minutes.
Use cases documented within Iguazu include:
- Dasher assignment monitoring: near-real-time warehouse data lets the Dasher assignment team detect algorithm bugs within minutes rather than the next day
- Mobile application health monitoring: checkout page load errors are routed to Chronosphere for operational alerting
- Real-time user sessionization and behaviour analysis
Real-time ML feature generation (Riviera and Fabricator)
Delivery events are consumed from Kafka topics by Flink jobs to produce real-time ML features — for example, recent average restaurant wait times. The Riviera framework (and its 2025 successor, Fabricator) expresses these pipelines as a SQL query plus a YAML configuration file, routing Kafka topic data through Flink into Redis for low-latency reads by inference services. By 2025, the Fabricator platform ran more than 100 pipelines generating 500 unique features at a rate of over 100 billion feature values per day.
Search index replication
The Search Platform team built a continuous indexing pipeline where database changes propagate through Kafka to Flink applications, which then update Elasticsearch. Before this pipeline, a full reindex of the search catalogue took up to one week; the Kafka and Flink approach made it continuous and near-real-time.
Advertising budget pacing
The ads platform publishes a record to a Kafka topic when a campaign hits its daily spend limit. A Flink streaming job consumes that topic and writes an expiration flag to CockroachDB, embedding the budget-capped state into DoorDash's in-house search index at the filtering stage. This change produced a 43% drop in search processor latency and a 45% reduction in discarded candidates compared to the previous approach.
Production testing via Kafka multi-tenancy
In 2024, DoorDash introduced a multi-tenant Kafka architecture to allow test traffic to flow through the same production topics and consumer instances as live traffic. OpenTelemetry context propagation carries a "doortest" tenant identifier on each Kafka record; consumers branch on that metadata without creating separate topics. This approach enables production-fidelity testing without topic proliferation.
Scale and throughput
DoorDash's Kafka architecture
High-level overview
DoorDash runs its Kafka infrastructure on AWS. The five clusters are operated by the Real-Time Streaming Platform (RTSP) team. Public sources do not describe the specific role of each cluster (whether they are segmented by team, environment, or criticality), though all five are provisioned and governed through the same internal tooling.
The Iguazu architecture has three main layers:
- Ingestion: An HTTP-based Kafka proxy (built on Confluent's open-source Kafka REST Proxy) runs on Kubernetes and accepts events from microservices and mobile clients. The proxy provides a REST interface so internal producers do not need native Kafka client configuration.
- Processing: Apache Flink jobs run as standalone Kubernetes services deployed via Helm charts, consuming Kafka topics and applying per-destination fan-out and format transformations.
- Delivery: Events are written to S3 in Parquet format, then ingested into Snowflake via Snowpipe (triggered by Amazon SQS notifications). ML features go to Redis. Operational metrics go to Chronosphere.
Event ingestion layer: the Iguazu Kafka proxy
DoorDash extended the open-source Confluent Kafka REST Proxy with several additions for the Iguazu use case:
- Multi-cluster routing (a single proxy endpoint routes to the appropriate cluster)
- Asynchronous producing (decoupling publisher latency from Kafka ACK wait)
- Metadata pre-fetching
- HTTP header passthrough for OpenTelemetry context propagation
Schema management
All events are defined in Protocol Buffers. Schemas are stored in a central shared Protobuf Git repository and validated at build time in CI/CD pipelines — invalid schemas are caught before deployment, not at runtime. Confluent Schema Registry handles runtime schema lookup.
Where downstream sinks require Avro (for example, certain legacy connectors), DoorDash's serialization library converts Protobuf to Avro transparently using Avro's protobuf library. Producers remain Protobuf-native and are unaware of Avro consumers.
Every event flowing through Iguazu uses a standard envelope containing: creation time, source service, encoding method reference, and a schema registry pointer.
Producer architecture
For high-volume event topics without natural message keys (random partition assignment), DoorDash's default producer configuration produced small, frequent batches and excessive broker CPU load. The RTSP team addressed this with two configuration changes:
- Switching to Kafka's built-in sticky partitioner (batching null-keyed records to the same partition until a batch is ready)
- Increasing
linger.msto 50-100 ms to accumulate larger batches
The result was a 30-40% reduction in Kafka broker CPU utilisation on those topics.
For topics requiring delivery guarantees, DoorDash uses replication factor 2 with min.insync.replicas set to 1 — a deliberate choice to favour throughput and cost over maximum durability on non-critical event streams.
Consumer architecture
Consumer groups are managed centrally by the RTSP team. Topic readiness after provisioning is confirmed by polling Prometheus metrics via Chronosphere; the Minions orchestration service waits until topic metrics appear in Chronosphere before marking an onboarding workflow complete.
Stream processing
Apache Flink is the primary stream processing engine. DoorDash exposes two interfaces to internal users:
- Flink DataStream API for custom stateful processing logic
- Flink SQL with YAML configuration (via the Riviera and Fabricator frameworks) for declarative feature pipelines, where engineers write a SQL query and a YAML file rather than Flink code
Data delivery paths
S3 acts as a durable buffer independent of Kafka retention. If Snowflake suffers downtime, failed Snowpipe ingestions can be backfilled from S3 without depending on Kafka replayability.
Per-event isolation is a design principle: each event type has its own dedicated Flink application and its own Snowpipe instance. Failures are contained to a single event type rather than cascading across a shared pipeline.
Special techniques and engineering decisions
Declarative ML feature pipelines
The Riviera framework (2021) and its successor Fabricator (2025) allow data scientists to express Flink stream processing jobs as a SQL query plus a YAML configuration file. The framework generates the Flink job; engineers who would otherwise need to write DataStream API code can instead write a single YAML file. Feature development time was reduced from weeks to hours, and the feature engineering codebase shrank by 70%.
S3 as a durable overflow buffer
Finite Kafka retention creates a risk: if Snowflake is unavailable long enough for Kafka logs to roll off, data is lost. DoorDash addressed this by routing all Iguazu Flink jobs through S3 (Parquet) before Snowpipe ingestion. S3 is not subject to Kafka retention windows, so backfills remain possible regardless of Kafka log state.
OpenTelemetry-based Kafka multi-tenancy
Rather than creating separate topics per test scenario, DoorDash propagates OpenTelemetry context (tenant ID and route information) on every Kafka record header. Production and test traffic share the same topics and consumer group instances; consumers inspect the OTEL header to branch handling. This avoids topic proliferation while providing production-fidelity test conditions.
Protobuf-to-Avro transparent conversion
DoorDash maintains Protobuf as the canonical internal event format. Where downstream sinks require Avro, its serialization library converts on the fly. This allows the company to remain Protobuf-native without modifying producers when adding Avro-dependent sinks.
Operating Kafka at scale
Deployment
All five Kafka clusters run on AWS, managed by the RTSP team. Flink jobs are deployed as standalone Kubernetes services via Helm charts. Infrastructure is managed with Terraform.
Topic governance
DoorDash went through three generations of topic provisioning tooling:
- Terraform / Atlantis (GitOps): Required infrastructure team review for every topic creation request. As the pace of new topics grew to roughly 100 per week, this created a significant support burden.
- Infra Service + Minions (2023): Replaced GitOps with an HTTP API (Infra Service) and a Cadence-backed orchestration service (Minions) that automates end-to-end Iguazu onboarding — creating Kafka topics, launching Flink jobs, creating Snowflake objects, and opening pull requests, with Slack notifications at each step. Configuration options exposed to users are deliberately restricted to capacity-related settings (retention, partition count) to prevent common misconfigurations. Onboarding time reduced by 95%, from multiple days to under one hour, typically within 15 minutes.
- Kafka Self-Serve (2024): Extended Infra Service to cover the full resource lifecycle — topics, user accounts, and ACLs — with auto-approval rules for routine, non-sensitive changes such as standard ACL grants or topic resizes within pre-set bounds.
Authentication and access control
DoorDash uses SASL/SCRAM for per-service Kafka authentication and ACLs for authorisation, controlling which producers and consumer groups can access each topic. The 2024 Kafka Self-Serve platform manages user account creation and ACL assignment through the same API-first workflow as topic creation.
Monitoring
Chronosphere is DoorDash's operational metrics backend for Kafka. Topic readiness after provisioning is confirmed by polling Prometheus metrics through Chronosphere's API — the Minions orchestration workflow waits on this before marking onboarding complete. Application-level operational events (for example, mobile checkout errors) are also routed through Iguazu Flink jobs to Chronosphere for alerting.
Schema evolution
Protobuf's inherent backward and forward compatibility is the primary schema evolution mechanism. Schemas must pass CI/CD validation before merging; runtime enforcement is handled through Confluent Schema Registry. The Fabricator framework enforces backward- and forward-compatible evolution at the framework level.
Challenges and how DoorDash solved them
RabbitMQ reliability failures at peak load
Problem: RabbitMQ repeatedly went down under heavy order volume. Because order checkout and Dasher assignment ran through RabbitMQ and Celery, broker failures directly caused customer-facing outages. Recovery was slow and manual; observability was poor.
Solution: Replaced RabbitMQ with Kafka for asynchronous task processing using a gradual traffic cutover with no downtime. Kafka's durable log and horizontal scalability eliminated the outage pattern.
Outcome: Task-processing outages from broker failures ceased; the system could scale horizontally to handle peak load.
Source: Ashwin Kachhara, DoorDash Engineering Blog, September 2020.
Fragmented SQS and Kinesis pipelines with one-day warehouse latency
Problem: Separate SQS and Kinesis pipelines for the data warehouse, ML platform, and metrics backend used incompatible technologies and communication paradigms, producing warehouse data latency of up to one day and significant operational overhead from managing disparate systems.
Solution: Consolidated onto a single Kafka + Flink platform (Iguazu). Kafka serves as the unified pub/sub hub; Flink applies per-destination fan-out with data format transformations.
Outcome: Warehouse latency from one day to a few minutes; a single platform team operates the infrastructure.
Source: Allen Wang, DoorDash Engineering Blog, August 2022.
High broker CPU on null-keyed event streams
Problem: For high-volume topics without message keys, the default round-robin partition assignment produced small, frequent batches, generating excessive CPU load on brokers.
Solution: Configured Kafka's sticky partitioner and increased linger.ms to 50-100 ms to accumulate larger batches per partition before flushing.
Outcome: 30-40% reduction in Kafka broker CPU utilisation on affected topics.
Source: Allen Wang, August 2022 (via Shen Zhu engineering blog summary).
Gevent incompatibility with librdkafka in Python
Problem: A Python point-of-sale service used Gevent for async I/O via monkey-patching. Gevent cannot patch librdkafka (a C library), so native Kafka consumer usage blocked Gevent's event loop.
Solution: Ran the Kafka consumer loop inside a dedicated Gevent greenlet. Blocking I/O calls inside the consumer were replaced with Gevent-compatible equivalents.
Outcome: The migrated service outperformed the previous Celery/Gevent worker under heavy I/O.
Source: Jessica Zhao and Boyang Wei, DoorDash Engineering Blog, February 2021.
Snowflake downtime risking data loss within Kafka retention windows
Problem: Kafka's finite log retention creates a data loss risk if Snowflake is unavailable long enough for topic logs to roll off.
Solution: All Iguazu Flink jobs write to S3 in Parquet format before Snowpipe ingestion. S3 serves as a durable buffer independent of Kafka retention; backfills are possible from S3 regardless of Kafka log state.
Source: Allen Wang, InfoQ / QCon Plus, June 2023.
Manual topic provisioning becoming a bottleneck
Problem: GitOps-based topic creation required infrastructure team review for every new topic. At approximately 100 new topics per week, this created a significant support burden and slowed engineering teams waiting for approvals.
Solution: Replaced GitOps with an HTTP API (Infra Service) backed by Minions/Cadence orchestration, with auto-approval for routine requests.
Outcome: Iguazu onboarding time reduced by 95%, from multiple days to under one hour.
Source: DoorDash Engineering Blog, December 2023.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Use S3 as a durable layer between Kafka and your warehouse. Kafka retention is finite; if your downstream sink has an outage, you lose the ability to replay. Routing Flink output through S3 (Parquet) before Snowpipe decouples ingestion from Kafka retention windows and allows backfill regardless of Kafka log state.
- Tune your producer for null-keyed high-volume topics. If you have event streams without natural message keys, round-robin partition assignment produces small, frequent batches and elevated broker CPU. Kafka's sticky partitioner combined with a
linger.msof 50-100 ms is a low-risk configuration change that can meaningfully reduce broker load. - Restrict what users can configure when self-serving topics. DoorDash's API-first governance deliberately exposes only capacity-related settings (retention, partition count) to internal users, hiding complex broker parameters. This design choice reduces misconfiguration without removing autonomy, and it scales better than a review-based GitOps model as topic volume grows.
- Consider isolating Flink jobs per event type rather than sharing pipelines. Per-event isolation means a failure in one pipeline does not cascade to others. The trade-off is more Kubernetes deployments to manage; DoorDash addressed that with Helm and centralised orchestration through Minions.
- If you run large-scale ML feature pipelines on Kafka, evaluate a declarative DSL layer. Riviera and Fabricator show that abstracting Flink SQL behind a YAML configuration significantly widens the pool of engineers who can create and maintain real-time feature pipelines, at the cost of framework complexity that the platform team must own.
Sources and further reading
Primary sources
- Ashwin Kachhara, "Eliminating Task Processing Outages by Replacing RabbitMQ with Apache Kafka Without Downtime," DoorDash Engineering Blog, September 2020.
- Jessica Zhao and Boyang Wei, "How to Make Kafka Consumer Compatible with Gevent in Python," DoorDash Engineering Blog, February 2021.
- DoorDash Engineering, "Building A Declarative Real-Time Feature Engineering Framework," DoorDash Engineering Blog, March 2021.
- Satish, Danial, and Siddharth, "Building Faster Indexing with Apache Kafka and Elasticsearch," DoorDash Engineering Blog, July 2021.
- Allen Wang, "Building Scalable Real-Time Event Processing with Kafka and Flink," DoorDash Engineering Blog, August 2022.
- Allen Wang, "From Zero to a Hundred Billion: Building Scalable Real-Time Event Processing at DoorDash," QCon San Francisco, October 2022.
- Allen Wang, "Building Scalable Real Time Event Processing with Kafka and Flink," InfoQ / QCon Plus (recorded), June 2023.
- DoorDash Engineering (RTSP team), "API-First Approach to Kafka Topic Creation," DoorDash Engineering Blog, December 2023.
- DoorDash Engineering, "Introducing DoorDash's In-House Search Engine," DoorDash Engineering Blog, February 2024.
- Carlos Herrera, Amit Gud, and Yunji Zhong, "Setting Up Kafka Multi-Tenancy," DoorDash Engineering Blog, March 2024.
- Seed Zeng, Kane Du, and Donovan Bai, "DoorDash Empowers Engineers with Kafka Self-Serve," DoorDash Engineering Blog, August 2024.
- DoorDash Engineering, "Introducing Fabricator: A Declarative Feature Engineering Framework," DoorDash Engineering Blog, April 2025.
If you are running Kafka in production and want visibility into consumer lag, topic health, and cluster performance, Kpow gives you a UI and monitoring layer that works with any Kafka deployment. You can try it with a free 30-day trial and connect it to any cluster in minutes via Docker, Helm, or JAR.