Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How DoorDash uses Apache Kafka in production

Table of contents

Factor House
May 16th, 2026
xx min read

DoorDash operates one of the largest food delivery platforms in the United States, coordinating millions of orders, restaurant partnerships, and Dasher assignments every day. Underpinning that coordination is an Apache Kafka deployment that spans five clusters, more than 2,500 topics, and around six billion messages per day on average — with peaks reaching twice that rate.

The company reached that scale through two distinct adoption waves: an initial migration from RabbitMQ in mid-2019 to stop task-processing outages, then a ground-up rebuild of its real-time data infrastructure under a platform called Iguazu that replaced Amazon Kinesis and SQS and brought data warehouse latency down from one day to a few minutes.

Company overview

DoorDash connects consumers, restaurants, and delivery drivers (Dashers) across the United States, Canada, Australia, and Japan. The platform processes a high volume of time-sensitive transactions — order placement, restaurant confirmation, Dasher assignment, and delivery completion — all of which depend on low-latency event propagation between microservices.

DoorDash first adopted Kafka in mid-2019 after repeated RabbitMQ failures began causing order checkout and Dasher dispatch outages under peak load. That initial migration, led by engineer Ashwin Kachhara, addressed the immediate reliability problem. A separate initiative, beginning around 2020 and led by Allen Wang (who had previously built Netflix's Keystone Kafka pipeline), took a broader view: replacing multiple fragmented cloud-native queuing systems with a unified, Kafka-centred streaming platform.

Kafka milestones at DoorDash

Date Event
Mid-2019 RabbitMQ and Celery outages drive adoption of Kafka for order checkout and Dasher assignment
Late 2019 Allen Wang joins from Netflix; begins building the Real-Time Streaming Platform (RTSP) team
~2020 Construction of Iguazu begins; Kafka replaces Amazon SQS and Kinesis as the central ingestion hub
2021 Riviera declarative ML feature framework goes into production (Kafka + Flink SQL + Redis)
2022 Iguazu reaches hundreds of billions of events per day; Allen Wang publishes the primary architecture post and presents at QCon San Francisco
2023 API-first Kafka topic governance replaces Terraform/Atlantis GitOps; 2,500+ topics, five clusters, six billion messages per day documented publicly
2024 Kafka multi-tenancy via OpenTelemetry context propagation; full self-serve CRUD for topics, users, and ACLs
2025 Fabricator (successor to Riviera) launched: 100+ pipelines, 500 ML features, 100 billion daily feature values

DoorDash's Kafka use cases

Asynchronous task processing

The original use case. Before mid-2019, DoorDash routed order checkout and Dasher assignment work through RabbitMQ and Celery. Repeated broker failures under peak load caused customer-facing outages with direct revenue impact; RabbitMQ offered limited observability and required slow, manual recovery steps.

Ashwin Kachhara and his team replaced RabbitMQ with Kafka for asynchronous task processing. Kafka's durable log, at-least-once delivery guarantees, and horizontal scalability eliminated the outage pattern. The migration was completed without downtime using a gradual traffic cutover approach.

Real-time event pipeline to Snowflake (Iguazu)

The more architecturally significant initiative. Before Iguazu, DoorDash maintained separate pipelines using Amazon SQS and Amazon Kinesis to feed its data warehouse, ML platform, and time-series metrics backend. Each pipeline used different technology and communication patterns, producing data warehouse latency of up to one day and significant operational overhead.

Iguazu replaced those fragmented pipelines with a single Kafka-centred platform. All microservices and mobile clients publish events via an HTTP-based Kafka proxy; Apache Flink jobs consume those topics and fan out to multiple destinations (Snowflake, Redis, Chronosphere) with format transformations applied per sink. End-to-end latency to Snowflake dropped from one day to a few minutes.

Use cases documented within Iguazu include:

  • Dasher assignment monitoring: near-real-time warehouse data lets the Dasher assignment team detect algorithm bugs within minutes rather than the next day
  • Mobile application health monitoring: checkout page load errors are routed to Chronosphere for operational alerting
  • Real-time user sessionization and behaviour analysis

Real-time ML feature generation (Riviera and Fabricator)

Delivery events are consumed from Kafka topics by Flink jobs to produce real-time ML features — for example, recent average restaurant wait times. The Riviera framework (and its 2025 successor, Fabricator) expresses these pipelines as a SQL query plus a YAML configuration file, routing Kafka topic data through Flink into Redis for low-latency reads by inference services. By 2025, the Fabricator platform ran more than 100 pipelines generating 500 unique features at a rate of over 100 billion feature values per day.

Search index replication

The Search Platform team built a continuous indexing pipeline where database changes propagate through Kafka to Flink applications, which then update Elasticsearch. Before this pipeline, a full reindex of the search catalogue took up to one week; the Kafka and Flink approach made it continuous and near-real-time.

Advertising budget pacing

The ads platform publishes a record to a Kafka topic when a campaign hits its daily spend limit. A Flink streaming job consumes that topic and writes an expiration flag to CockroachDB, embedding the budget-capped state into DoorDash's in-house search index at the filtering stage. This change produced a 43% drop in search processor latency and a 45% reduction in discarded candidates compared to the previous approach.

Production testing via Kafka multi-tenancy

In 2024, DoorDash introduced a multi-tenant Kafka architecture to allow test traffic to flow through the same production topics and consumer instances as live traffic. OpenTelemetry context propagation carries a "doortest" tenant identifier on each Kafka record; consumers branch on that metadata without creating separate topics. This approach enables production-fidelity testing without topic proliferation.

Scale and throughput

Metric Figure Source
Kafka clusters 5 DoorDash Engineering Blog, December 2023
Topics 2,500+ DoorDash Engineering Blog, December 2023
Daily message volume ~6 billion messages DoorDash Engineering Blog, December 2023
Average throughput ~4 million messages per minute DoorDash Engineering Blog, December 2023
Peak throughput ~8 million messages per minute DoorDash Engineering Blog, December 2023
Iguazu event volume Hundreds of billions of events per day Allen Wang, DoorDash Engineering Blog, August 2022
Delivery guarantee 99.99% (four nines) Allen Wang, DoorDash Engineering Blog, August 2022
Data loss rate (async mode) Less than 0.001% Allen Wang, InfoQ / QCon Plus, June 2023
Flink daily processing volume 220 TB per day Junaid Effendi, The Sequence (citing DoorDash engineering sources)
New topics provisioned per week ~100 DoorDash Engineering Blog, December 2023
Warehouse latency improvement ~1 day to a few minutes Allen Wang, InfoQ / QCon Plus, June 2023
Fabricator daily feature values 100+ billion DoorDash Engineering Blog, April 2025

DoorDash's Kafka architecture

High-level overview

DoorDash runs its Kafka infrastructure on AWS. The five clusters are operated by the Real-Time Streaming Platform (RTSP) team. Public sources do not describe the specific role of each cluster (whether they are segmented by team, environment, or criticality), though all five are provisioned and governed through the same internal tooling.

The Iguazu architecture has three main layers:

  1. Ingestion: An HTTP-based Kafka proxy (built on Confluent's open-source Kafka REST Proxy) runs on Kubernetes and accepts events from microservices and mobile clients. The proxy provides a REST interface so internal producers do not need native Kafka client configuration.
  2. Processing: Apache Flink jobs run as standalone Kubernetes services deployed via Helm charts, consuming Kafka topics and applying per-destination fan-out and format transformations.
  3. Delivery: Events are written to S3 in Parquet format, then ingested into Snowflake via Snowpipe (triggered by Amazon SQS notifications). ML features go to Redis. Operational metrics go to Chronosphere.

Event ingestion layer: the Iguazu Kafka proxy

DoorDash extended the open-source Confluent Kafka REST Proxy with several additions for the Iguazu use case:

  • Multi-cluster routing (a single proxy endpoint routes to the appropriate cluster)
  • Asynchronous producing (decoupling publisher latency from Kafka ACK wait)
  • Metadata pre-fetching
  • HTTP header passthrough for OpenTelemetry context propagation

Schema management

All events are defined in Protocol Buffers. Schemas are stored in a central shared Protobuf Git repository and validated at build time in CI/CD pipelines — invalid schemas are caught before deployment, not at runtime. Confluent Schema Registry handles runtime schema lookup.

Where downstream sinks require Avro (for example, certain legacy connectors), DoorDash's serialization library converts Protobuf to Avro transparently using Avro's protobuf library. Producers remain Protobuf-native and are unaware of Avro consumers.

Every event flowing through Iguazu uses a standard envelope containing: creation time, source service, encoding method reference, and a schema registry pointer.

Producer architecture

For high-volume event topics without natural message keys (random partition assignment), DoorDash's default producer configuration produced small, frequent batches and excessive broker CPU load. The RTSP team addressed this with two configuration changes:

  • Switching to Kafka's built-in sticky partitioner (batching null-keyed records to the same partition until a batch is ready)
  • Increasing linger.ms to 50-100 ms to accumulate larger batches

The result was a 30-40% reduction in Kafka broker CPU utilisation on those topics.

For topics requiring delivery guarantees, DoorDash uses replication factor 2 with min.insync.replicas set to 1 — a deliberate choice to favour throughput and cost over maximum durability on non-critical event streams.

Consumer architecture

Consumer groups are managed centrally by the RTSP team. Topic readiness after provisioning is confirmed by polling Prometheus metrics via Chronosphere; the Minions orchestration service waits until topic metrics appear in Chronosphere before marking an onboarding workflow complete.

Stream processing

Apache Flink is the primary stream processing engine. DoorDash exposes two interfaces to internal users:

  • Flink DataStream API for custom stateful processing logic
  • Flink SQL with YAML configuration (via the Riviera and Fabricator frameworks) for declarative feature pipelines, where engineers write a SQL query and a YAML file rather than Flink code

Data delivery paths

Destination Path
Snowflake (data warehouse) Kafka topic → Flink job → S3 (Parquet) → SQS notification → Snowpipe → Snowflake
Redis (ML feature store) Kafka topic → Flink SQL job (Riviera / Fabricator) → Redis
Chronosphere (operational metrics) Kafka topic → Flink job → Chronosphere time-series backend
Elasticsearch (search index) Database change event → Kafka topic → Flink job → Elasticsearch
CockroachDB (ad budget state) Ad spend event → Kafka topic → Flink streaming job → CockroachDB

S3 acts as a durable buffer independent of Kafka retention. If Snowflake suffers downtime, failed Snowpipe ingestions can be backfilled from S3 without depending on Kafka replayability.

Per-event isolation is a design principle: each event type has its own dedicated Flink application and its own Snowpipe instance. Failures are contained to a single event type rather than cascading across a shared pipeline.

Special techniques and engineering decisions

Declarative ML feature pipelines

The Riviera framework (2021) and its successor Fabricator (2025) allow data scientists to express Flink stream processing jobs as a SQL query plus a YAML configuration file. The framework generates the Flink job; engineers who would otherwise need to write DataStream API code can instead write a single YAML file. Feature development time was reduced from weeks to hours, and the feature engineering codebase shrank by 70%.

S3 as a durable overflow buffer

Finite Kafka retention creates a risk: if Snowflake is unavailable long enough for Kafka logs to roll off, data is lost. DoorDash addressed this by routing all Iguazu Flink jobs through S3 (Parquet) before Snowpipe ingestion. S3 is not subject to Kafka retention windows, so backfills remain possible regardless of Kafka log state.

OpenTelemetry-based Kafka multi-tenancy

Rather than creating separate topics per test scenario, DoorDash propagates OpenTelemetry context (tenant ID and route information) on every Kafka record header. Production and test traffic share the same topics and consumer group instances; consumers inspect the OTEL header to branch handling. This avoids topic proliferation while providing production-fidelity test conditions.

Protobuf-to-Avro transparent conversion

DoorDash maintains Protobuf as the canonical internal event format. Where downstream sinks require Avro, its serialization library converts on the fly. This allows the company to remain Protobuf-native without modifying producers when adding Avro-dependent sinks.

Operating Kafka at scale

Deployment

All five Kafka clusters run on AWS, managed by the RTSP team. Flink jobs are deployed as standalone Kubernetes services via Helm charts. Infrastructure is managed with Terraform.

Topic governance

DoorDash went through three generations of topic provisioning tooling:

  1. Terraform / Atlantis (GitOps): Required infrastructure team review for every topic creation request. As the pace of new topics grew to roughly 100 per week, this created a significant support burden.
  2. Infra Service + Minions (2023): Replaced GitOps with an HTTP API (Infra Service) and a Cadence-backed orchestration service (Minions) that automates end-to-end Iguazu onboarding — creating Kafka topics, launching Flink jobs, creating Snowflake objects, and opening pull requests, with Slack notifications at each step. Configuration options exposed to users are deliberately restricted to capacity-related settings (retention, partition count) to prevent common misconfigurations. Onboarding time reduced by 95%, from multiple days to under one hour, typically within 15 minutes.
  3. Kafka Self-Serve (2024): Extended Infra Service to cover the full resource lifecycle — topics, user accounts, and ACLs — with auto-approval rules for routine, non-sensitive changes such as standard ACL grants or topic resizes within pre-set bounds.

Authentication and access control

DoorDash uses SASL/SCRAM for per-service Kafka authentication and ACLs for authorisation, controlling which producers and consumer groups can access each topic. The 2024 Kafka Self-Serve platform manages user account creation and ACL assignment through the same API-first workflow as topic creation.

Monitoring

Chronosphere is DoorDash's operational metrics backend for Kafka. Topic readiness after provisioning is confirmed by polling Prometheus metrics through Chronosphere's API — the Minions orchestration workflow waits on this before marking onboarding complete. Application-level operational events (for example, mobile checkout errors) are also routed through Iguazu Flink jobs to Chronosphere for alerting.

Schema evolution

Protobuf's inherent backward and forward compatibility is the primary schema evolution mechanism. Schemas must pass CI/CD validation before merging; runtime enforcement is handled through Confluent Schema Registry. The Fabricator framework enforces backward- and forward-compatible evolution at the framework level.

Challenges and how DoorDash solved them

RabbitMQ reliability failures at peak load

Problem: RabbitMQ repeatedly went down under heavy order volume. Because order checkout and Dasher assignment ran through RabbitMQ and Celery, broker failures directly caused customer-facing outages. Recovery was slow and manual; observability was poor.

Solution: Replaced RabbitMQ with Kafka for asynchronous task processing using a gradual traffic cutover with no downtime. Kafka's durable log and horizontal scalability eliminated the outage pattern.

Outcome: Task-processing outages from broker failures ceased; the system could scale horizontally to handle peak load.

Source: Ashwin Kachhara, DoorDash Engineering Blog, September 2020.

Fragmented SQS and Kinesis pipelines with one-day warehouse latency

Problem: Separate SQS and Kinesis pipelines for the data warehouse, ML platform, and metrics backend used incompatible technologies and communication paradigms, producing warehouse data latency of up to one day and significant operational overhead from managing disparate systems.

Solution: Consolidated onto a single Kafka + Flink platform (Iguazu). Kafka serves as the unified pub/sub hub; Flink applies per-destination fan-out with data format transformations.

Outcome: Warehouse latency from one day to a few minutes; a single platform team operates the infrastructure.

Source: Allen Wang, DoorDash Engineering Blog, August 2022.

High broker CPU on null-keyed event streams

Problem: For high-volume topics without message keys, the default round-robin partition assignment produced small, frequent batches, generating excessive CPU load on brokers.

Solution: Configured Kafka's sticky partitioner and increased linger.ms to 50-100 ms to accumulate larger batches per partition before flushing.

Outcome: 30-40% reduction in Kafka broker CPU utilisation on affected topics.

Source: Allen Wang, August 2022 (via Shen Zhu engineering blog summary).

Gevent incompatibility with librdkafka in Python

Problem: A Python point-of-sale service used Gevent for async I/O via monkey-patching. Gevent cannot patch librdkafka (a C library), so native Kafka consumer usage blocked Gevent's event loop.

Solution: Ran the Kafka consumer loop inside a dedicated Gevent greenlet. Blocking I/O calls inside the consumer were replaced with Gevent-compatible equivalents.

Outcome: The migrated service outperformed the previous Celery/Gevent worker under heavy I/O.

Source: Jessica Zhao and Boyang Wei, DoorDash Engineering Blog, February 2021.

Snowflake downtime risking data loss within Kafka retention windows

Problem: Kafka's finite log retention creates a data loss risk if Snowflake is unavailable long enough for topic logs to roll off.

Solution: All Iguazu Flink jobs write to S3 in Parquet format before Snowpipe ingestion. S3 serves as a durable buffer independent of Kafka retention; backfills are possible from S3 regardless of Kafka log state.

Source: Allen Wang, InfoQ / QCon Plus, June 2023.

Manual topic provisioning becoming a bottleneck

Problem: GitOps-based topic creation required infrastructure team review for every new topic. At approximately 100 new topics per week, this created a significant support burden and slowed engineering teams waiting for approvals.

Solution: Replaced GitOps with an HTTP API (Infra Service) backed by Minions/Cadence orchestration, with auto-approval for routine requests.

Outcome: Iguazu onboarding time reduced by 95%, from multiple days to under one hour.

Source: DoorDash Engineering Blog, December 2023.

Full tech stack

Category Technology Role
Message broker Apache Kafka Central pub/sub and event streaming backbone; 5 clusters, 2,500+ topics
Stream processing Apache Flink DataStream API for custom logic; Flink SQL for declarative feature pipelines
Event ingestion proxy Confluent Kafka REST Proxy (self-hosted, open source) HTTP-based event ingestion layer (Iguazu Kafka Proxy) for microservices and mobile clients
Schema registry Confluent Schema Registry Runtime schema lookup and enforcement for Protobuf and Avro events
Schema format (primary) Protocol Buffers (Protobuf) Primary event schema format; defined in a central shared Git repository, validated at CI/CD build time
Schema format (secondary) Avro Auto-converted from Protobuf by DoorDash's serialization library for sinks that require it
Intermediate storage Amazon S3 Durable buffer for Parquet-format event data between Flink jobs and Snowflake
Warehouse ingestion Amazon SQS + Snowpipe SQS notifications trigger Snowpipe to ingest from S3 to Snowflake
Data warehouse Snowflake Analytical data warehouse for business intelligence and Dasher monitoring
Feature store Redis Sink for Riviera/Fabricator real-time ML feature pipelines; serves inference services
OLAP / internal analytics Apache Pinot Real-time OLAP for ad campaign reporting and Risk Platform dashboards; fed from Kafka
Monitoring backend Chronosphere Operational metrics and alerting; receives real-time events from Iguazu; confirms Kafka topic readiness after provisioning
Container orchestration Kubernetes Runs Flink jobs (as standalone services) and the Kafka REST Proxy
Deployment Helm Deploys Flink jobs on Kubernetes
Infrastructure-as-code Terraform Infrastructure provisioning for Kafka and surrounding systems
Workflow orchestration Cadence Workflow engine backing the Minions orchestration service for Iguazu onboarding
Internal UI Retool Internal management UI for Iguazu pipeline configurations
Authentication SASL/SCRAM Per-service Kafka authentication
Authorisation Kafka ACLs Controls which producers and consumer groups can access each topic
Observability / context OpenTelemetry (OTEL) Context propagation for Kafka multi-tenancy; carries tenant and route metadata on each record header
Search index Elasticsearch Continuously updated via Kafka + Flink indexing pipeline
Ad state storage CockroachDB Stores budget-capped flags written by the Kafka/Flink advertising pipeline
Batch processing Apache Spark Large-scale data lake transformations (separate from the Kafka streaming path)
Batch orchestration Apache Airflow Orchestrates batch pipelines
Feature pipeline orchestration Dagster DAG construction and orchestration within the Fabricator feature engineering framework
SQL query engine Trino Unified SQL query layer over the data lake

Key contributors

Name Title / team Contribution
Allen Wang Tech Lead, Data Platform; founding member, Real-Time Streaming Platform Architect of Iguazu; authored the primary August 2022 engineering blog post; presented at QCon San Francisco 2022 and QCon Plus 2022; previously built Netflix's Keystone Kafka pipeline; contributed rack-aware consumer partition assignment to Apache Kafka
Ashwin Kachhara Software Engineer, Kotlin and Go Platform team Led the 2019 RabbitMQ-to-Kafka migration; authored the September 2020 engineering blog post; presented at the Bay Area Apache Kafka meetup and a Confluent event
Jessica Zhao Software Engineer Co-authored the February 2021 post on making Kafka consumer compatible with Gevent in Python
Boyang Wei Software Engineer Co-authored the February 2021 post on making Kafka consumer compatible with Gevent in Python
Carlos Herrera Software Engineer, Developer Platform (Infrastructure Engineering) Lead author on the March 2024 Kafka multi-tenancy blog post
Seed Zeng Software Engineer, Storage team Co-authored the August 2024 Kafka Self-Serve blog post
Kane Du Software Engineer, Storage Self-Serve platform Co-authored the August 2024 Kafka Self-Serve blog post
Donovan Bai Software Engineer, Storage team (stateful systems including Kafka) Co-authored the August 2024 Kafka Self-Serve blog post

Key takeaways for your own Kafka implementation

  • Use S3 as a durable layer between Kafka and your warehouse. Kafka retention is finite; if your downstream sink has an outage, you lose the ability to replay. Routing Flink output through S3 (Parquet) before Snowpipe decouples ingestion from Kafka retention windows and allows backfill regardless of Kafka log state.
  • Tune your producer for null-keyed high-volume topics. If you have event streams without natural message keys, round-robin partition assignment produces small, frequent batches and elevated broker CPU. Kafka's sticky partitioner combined with a linger.ms of 50-100 ms is a low-risk configuration change that can meaningfully reduce broker load.
  • Restrict what users can configure when self-serving topics. DoorDash's API-first governance deliberately exposes only capacity-related settings (retention, partition count) to internal users, hiding complex broker parameters. This design choice reduces misconfiguration without removing autonomy, and it scales better than a review-based GitOps model as topic volume grows.
  • Consider isolating Flink jobs per event type rather than sharing pipelines. Per-event isolation means a failure in one pipeline does not cascade to others. The trade-off is more Kubernetes deployments to manage; DoorDash addressed that with Helm and centralised orchestration through Minions.
  • If you run large-scale ML feature pipelines on Kafka, evaluate a declarative DSL layer. Riviera and Fabricator show that abstracting Flink SQL behind a YAML configuration significantly widens the pool of engineers who can create and maintain real-time feature pipelines, at the cost of framework complexity that the platform team must own.

Sources and further reading

Primary sources

  1. Ashwin Kachhara, "Eliminating Task Processing Outages by Replacing RabbitMQ with Apache Kafka Without Downtime," DoorDash Engineering Blog, September 2020.
  2. Jessica Zhao and Boyang Wei, "How to Make Kafka Consumer Compatible with Gevent in Python," DoorDash Engineering Blog, February 2021.
  3. DoorDash Engineering, "Building A Declarative Real-Time Feature Engineering Framework," DoorDash Engineering Blog, March 2021.
  4. Satish, Danial, and Siddharth, "Building Faster Indexing with Apache Kafka and Elasticsearch," DoorDash Engineering Blog, July 2021.
  5. Allen Wang, "Building Scalable Real-Time Event Processing with Kafka and Flink," DoorDash Engineering Blog, August 2022.
  6. Allen Wang, "From Zero to a Hundred Billion: Building Scalable Real-Time Event Processing at DoorDash," QCon San Francisco, October 2022.
  7. Allen Wang, "Building Scalable Real Time Event Processing with Kafka and Flink," InfoQ / QCon Plus (recorded), June 2023.
  8. DoorDash Engineering (RTSP team), "API-First Approach to Kafka Topic Creation," DoorDash Engineering Blog, December 2023.
  9. DoorDash Engineering, "Introducing DoorDash's In-House Search Engine," DoorDash Engineering Blog, February 2024.
  10. Carlos Herrera, Amit Gud, and Yunji Zhong, "Setting Up Kafka Multi-Tenancy," DoorDash Engineering Blog, March 2024.
  11. Seed Zeng, Kane Du, and Donovan Bai, "DoorDash Empowers Engineers with Kafka Self-Serve," DoorDash Engineering Blog, August 2024.
  12. DoorDash Engineering, "Introducing Fabricator: A Declarative Feature Engineering Framework," DoorDash Engineering Blog, April 2025.

If you are running Kafka in production and want visibility into consumer lag, topic health, and cluster performance, Kpow gives you a UI and monitoring layer that works with any Kafka deployment. You can try it with a free 30-day trial and connect it to any cluster in minutes via Docker, Helm, or JAR.