
How Robinhood uses Apache Kafka in production
Table of contents
Robinhood's Kafka platform handles 2.2 million messages per second across 6 clusters, 160 brokers, and 90,000 partitions, underpinning every step of its trading lifecycle from order routing to fraud detection to data lake ingestion. What makes its story particularly instructive is not the scale alone but the engineering choices the team made to reach it: a custom sidecar proxy to solve Python's connection fan-in problem, a Postgres-backed dead letter queue that trades Kafka simplicity for richer replay tooling, and a 2025 decision to replace self-managed Kafka with WarpStream for logging workloads, cutting total costs by 45%.
Across fintech, Kafka is common. At Robinhood the challenge was operating a large, heterogeneous fleet of Python-based Kafka applications at financial-grade reliability without overwhelming a small platform team.
Company overview
Robinhood is a US-based brokerage and financial services platform offering commission-free trading in equities, ETFs, options, and cryptocurrency, along with a debit card and cash management product. As of 2025, the platform serves more than 14 million monthly active users.
Kafka has been part of Robinhood's infrastructure since at least 2015, when engineer Jaren Glover joined and was assigned ownership of stabilising an existing Kafka, ZooKeeper, and Elasticsearch pipeline. Over the following four years, Glover scaled that infrastructure from 100,000 to 10 million users, presenting what he learned at SREcon Americas and Kafka Summit in 2019.
By 2017, the team had begun developing Faust, an open-source Python stream-processing library built on top of Apache Kafka. Faust was open-sourced in July 2018 and reflected how central Kafka had become to Robinhood's product engineering.
Robinhood's Kafka use cases
The Streaming Platform team describes Kafka as involved in "almost every mission-critical step of Robinhood's functionality." There are 14 confirmed production use cases across product, data, and infrastructure teams.
Equities order routing and trade execution. Stock purchase events are written to both Kafka and Postgres. Multiple downstream services consume those events and process them using Kafka Streams with exactly-once semantics. The trade execution confirmation SLA is under one second, with 5-10 Kafka hops per trade depending on product type.
Crypto trading. The crypto backend uses shard-aware Kafka consumers. Each shard only processes messages belonging to users assigned to it, limiting the blast radius of any single consumer failure.
Self-clearing. Robinhood's in-house clearing operations are listed as a distinct Kafka use case by the Streaming Platform team, though detailed architecture has not been published.
Market data distribution. External market data feeds are ingested via Kafka and consumed by internal Faust stream-processing applications.
Push notifications and messaging. Named explicitly as a production use case in the 2021 Kafka Summit Americas talk by Chandra Kuchi and Nick Dellamaggiore.
Fraud detection and risk. Faust processes risk signals in real time. Apache Flink handles stateful stream processing for fraud detection and shareholder position tracking, with more than 100 concurrent Flink deployments running in production.
Order execution quality monitoring and ad tracking. Both are listed as production Faust use cases, alongside newsfeed aggregation.
CDC into the data lake. Debezium captures WAL changes from thousands of Postgres RDS tables, encodes them as Avro records via Confluent Schema Registry, and writes them to Kafka. Apache Hudi Deltastreamer consumes those records with exactly-once semantics and writes into S3. Data freshness for core datasets improved from 24 hours to 5-15 minutes.
Card transaction authorisation (backup path). The backup payment authorisation service subscribes to asynchronous Kafka streams to cache each cardholder's latest account state. The service must produce a decision within a 2-second window.
Investor eligibility (Say Technologies acquisition). Apache Flink determines investor eligibility with approximately 2-minute end-to-end latency.
Log analytics and observability. Application and infrastructure logs flow through Kafka (and since 2025, WarpStream) via the Vector log shipper to Humio for search and storage.
Scale and throughput
The figures below come from Chandra Kuchi and Nick Dellamaggiore's Kafka Summit Americas 2021 presentation, reflecting the state of the platform at that time, with more recent additions from a 2025 WarpStream case study and a September 2024 Diginomica interview with Kuchi.
Robinhood's Kafka architecture
Cluster topology and hosting
Robinhood originally ran self-managed Confluent Platform on AWS EC2, provisioned with Saltstack. Brokers were spread across three AWS availability zones per cluster.
By 2024, the team had migrated all clusters to Kubernetes. Rather than adopting an existing Kafka Kubernetes operator, Robinhood built a custom in-house operator to manage the migration and ongoing operations. Mun Yong Jang and Nathan Moderwell presented details of this migration at Confluent Current 2024. The three-AZ layout was preserved in the new Kubernetes deployment.
Logging workloads were separated from the main Kafka clusters in 2025 and migrated to WarpStream, a diskless, S3-backed Kafka-compatible broker. WarpStream runs as three independent Kubernetes deployments (one per AZ), each with its own horizontal pod autoscaler, so logging capacity tracks market-hours traffic without manual intervention. Ethan Chen and Renan Rueda from Robinhood presented this architecture at Confluent Current New Orleans 2025.
Cloud provider: AWS throughout. S3 serves as primary data lake storage and as WarpStream's backing store. Postgres runs on AWS RDS.
Schema management
Confluent Schema Registry is used for Avro encoding on the CDC pipeline. Debezium captures Postgres WAL and writes Avro-encoded change records to Kafka via the Schema Registry before downstream consumers read them into the data lake.
Producer architecture
All Python and Golang Kafka clients at Robinhood use kafkahood, an internal wrapper library built on librdkafka. kafkahood codifies standard defaults, client-side metrics collection, dead letter queue support, message serialisation, and feature-flag-based configuration rollouts. This means the Streaming Platform team can push new client defaults across the entire application fleet by flipping a feature flag rather than waiting for individual teams to update their library versions.
Producer connections from Python services are routed through kafkaproxy (described in detail in the Special Techniques section below).
Consumer architecture
Consumer lifecycle management is handled by the Consumer Proxy (kafkaproxy v2), a Java-based Kubernetes sidecar that runs alongside each application container. The proxy handles all Kafka consumer group coordination, rebalancing, commit management, and timeout management. It relays messages to the application container over gRPC. This design means a consumer group rebalance triggered by a scaling event does not affect the application container's business logic, and a failure in the application container does not orphan an uncommitted Kafka offset.
kafkahood also provides a Postgres-backed dead letter queue for failed consumer events. Failed messages are inserted into Postgres rather than a secondary Kafka topic, and a CLI allows engineers to inspect, query, and fix records before republishing them to a retry topic.
Consumer lag, throughput, and error rates are surfaced per application as standard defaults through kafkahood's client-side metrics, without each team needing to add their own instrumentation.
Stream processing
Faust (Python). Robinhood open-sourced Faust in 2018 as a Python port of the Kafka Streams API. In production, Faust applications handle risk signal processing, order quality monitoring, market data feed processing, newsfeed aggregation, ad tracking, event logging, and crypto feed processing. Faust uses RocksDB as a local embedded state store and aiokafka (which Robinhood also maintains as a fork) as its async Kafka client.
Apache Flink. Stateful stream processing for fraud detection, shareholder position tracking, investor eligibility determination, and data ingestion is handled by Flink. Robinhood runs more than 100 concurrent Flink deployments, managed by the Apache flink-kubernetes-operator. Tony Chen from the Streaming Platform team presented Robinhood's Flink deployment practices, including a custom Checkpoint-Fetcher tool for automating recovery, at Confluent Current 2024.
Kafka Connect ecosystem
Debezium is used as the primary Kafka Connect source connector, capturing CDC events from Postgres RDS. It encodes changes as Avro records using Confluent Schema Registry and writes them to Kafka for consumption by Apache Hudi Deltastreamer on the data lake side. Secor (originally open-sourced by Pinterest) archives Kafka streams to S3 for the batch path of the data lake.
Special techniques and engineering innovations
Kafkaproxy as a sidecar for connection aggregation. Python's process-based concurrency model creates a problem at Kafka scale: each gunicorn worker process opens its own set of TCP connections to brokers, which at Robinhood's cluster sizes produced more than 100,000 inbound connections on the largest cluster. The first-generation kafkaproxy was a Rust-based Kubernetes sidecar that ran alongside each application pod and aggregated all producer and consumer connections from that pod's worker pool into a single connection set to the brokers. This cut broker-side resource pressure from Python services substantially.
Consumer Proxy with gRPC message relay (v2). The second generation of kafkaproxy, presented by Tony Chen and Mun Yong Jang at Confluent Current 2023, replaced the Rust implementation with a Java-based sidecar using the standard Apache Kafka client library. The v2 proxy takes full ownership of the Kafka consumer lifecycle: it handles consumer group rebalancing, commit management, and timeout management, then relays raw messages to the application container over gRPC. This decoupling means the application team no longer needs to reason about consumer group behaviour, and the platform team maintains a single Java consumer implementation rather than N language-specific clients.
Shard-aware Kafka message routing. During the 2019-2020 period when brokerage traffic grew 7x, Robinhood introduced application-level sharding over Postgres. Each shard's Kafka consumers filter messages on shared topics to only process events belonging to users assigned to that shard. For high-throughput flows where the per-message filtering overhead became significant, Robinhood created shard-specific Kafka topics to eliminate the filtering cost entirely.
CDC with exactly-once semantics into the data lake. The CDC pipeline chains Debezium (Postgres WAL capture), Confluent Schema Registry (Avro encoding), Kafka (transport), and Apache Hudi Deltastreamer (exactly-once writes into S3). Copy-on-write mode is used for raw tables to optimise columnar read performance. This pipeline reduced data freshness from 24 hours to 5-15 minutes across thousands of Postgres tables.
Postgres-backed dead letter queue. Rather than routing failed consumer events to a secondary Kafka topic, Robinhood writes them to a Postgres database. A client application with a CLI allows engineers to inspect and query failed records, apply fixes, and republish messages to a Kafka retry topic for safe reprocessing. Sreeram Ramji and Wenlong Xiong, who presented this pattern at Confluent Current 2022, noted that cross-team schema contracts make silent consumer failures particularly hazardous at Robinhood, which drove the choice of a more inspectable store over a native Kafka DLQ topic.
WarpStream for elastic logging clusters. Log throughput at Robinhood is highly cyclical: it peaks during US market hours and drops significantly in evenings and on weekends. Running fixed-capacity Kafka clusters for this workload was costly and wasteful. By migrating logging to WarpStream, the team eliminated inter-AZ data transfer costs entirely (WarpStream's diskless architecture does not move data between availability zones) and gained per-AZ horizontal pod autoscaling. The result was a 45% reduction in total logging infrastructure cost, broken down as 36% compute savings, 13% storage savings, and 99% inter-AZ networking savings. The trade-off was an increase in end-to-end latency from 0.2 to 0.45 seconds, which was acceptable for the logging workload.
Feature-flag-based client library rollouts. kafkahood uses feature flags to control adoption of new client configuration defaults across the application fleet. A new default setting can be tested on a subset of services before rolling out broadly, with fast rollback if it causes latency or throughput regressions.
Operating Kafka at scale
Deployment model. Robinhood self-manages its Kafka infrastructure throughout its history, initially on EC2 with Saltstack and since 2024 on Kubernetes with a custom in-house operator. The WarpStream deployment for logging is also self-hosted (Kubernetes) rather than using WarpStream Cloud.
Observability. Client-side metrics collection is built into kafkahood as a standard default for all Python and Golang services. This provides per-application consumer lag, throughput, and error rates without requiring each team to instrument independently. The Vector log shipper routes application and infrastructure logs to Kafka (or WarpStream) for forwarding to Humio.
WarpStream Agent Groups for multi-tenant isolation. Within the logging WarpStream cluster, Agent Groups isolate traffic classes: VIP clients, SSL/TLS connections, and plaintext connections each run in separate groups. This prevents a high-throughput producer from saturating broker resources shared with other clients.
Flink checkpoint management. The Apache flink-kubernetes-operator had stability issues when reconciling more than 100 concurrent Flink deployments simultaneously. Robinhood tuned the operator's concurrent reconciliation configuration and built a custom Checkpoint-Fetcher tool that automates locating the latest valid Flink checkpoint and updating the initialSavepointPath field in the Kubernetes manifest. This reduces human error during recovery and rollback operations. Automatic rollback to the last healthy application version via health checks is also implemented.
Confluent Platform usage. Chandra Kuchi confirmed in a September 2024 interview that Robinhood uses Confluent CP. The current status of Confluent's commercial components post-Kubernetes migration has not been detailed in public sources.
Note: Robinhood's SRE team has not published details of Kafka alerting thresholds, SLO definitions, quota management, or CI/CD pipelines for topic provisioning in any source reviewed for this article.
Challenges and how they solved them
Python connection fan-in at broker scale. Robinhood's Python services use the gunicorn process-based worker model. Each worker process opens independent Kafka connections, and at hundreds of workers per cluster the broker inbound connection count exceeded 100,000 on the largest cluster, creating resource exhaustion and connection management overhead. The solution was kafkaproxy: a Kubernetes sidecar that aggregates all connections from a pod's worker pool into a shared set, dramatically reducing broker-side connection counts. A second generation of the proxy (in Java) later extended this to full consumer lifecycle management via gRPC.
Divergent Kafka client libraries across languages. Separate Kafka client wrappers for Python and Golang were drifting apart in defaults, error handling, and observability. kafkahood standardised client behaviour across languages with a single librdkafka-based library. The v2 consumer proxy went further, centralising the consumer group logic in one Java implementation that serves both Golang and Python application containers via gRPC.
Failed consumer events with no safe retry path. Schema mismatches and deserialization errors cause consumer events to fail permanently unless there is a way to inspect and fix the message before replaying it. A secondary Kafka DLQ topic does not provide sufficient tooling for this. Robinhood built a Postgres-backed DLQ with a CLI for inspection and targeted republishing to a retry topic.
Data lake freshness of 24 hours. Batch ingestion from Postgres into S3 meant analytics consumers worked with day-old data. The CDC pipeline (Debezium, Confluent Schema Registry, Kafka, Apache Hudi Deltastreamer with exactly-once writes) reduced freshness to 5-15 minutes across thousands of Postgres tables.
7x brokerage traffic growth in six months. Requests grew from 100k to 750k per second between December 2019 and June 2020, overwhelming a single-shard brokerage system. Application-level sharding distributed load across Postgres shards, each with its own Kafka consumers. High-throughput flows received shard-specific Kafka topics to eliminate per-message filtering overhead.
Logging infrastructure cost and elasticity. Self-managed Kafka clusters sized for peak market-hours throughput sat largely idle in evenings and weekends, and incurred inter-AZ data transfer fees continuously. Migrating to WarpStream eliminated the inter-AZ networking cost entirely (99% reduction) and allowed per-AZ autoscaling to track actual throughput. Total cost fell 45%.
Managing 100+ concurrent Flink deployments. The Apache flink-kubernetes-operator struggled with concurrent reconciliation at this scale, and manual checkpoint management introduced human error during recovery. Robinhood tuned the operator's concurrency settings, built the Checkpoint-Fetcher tool to automate manifest updates, and implemented health-check-based automatic rollback.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
Connection fan-in is a real problem for process-based concurrency models. If you run Python services with gunicorn or uWSGI and connect them directly to Kafka, each worker process opens its own connections. At a few hundred workers per cluster this adds up to tens of thousands of inbound connections on your brokers. A sidecar proxy that aggregates connections from a single pod is one approach; Robinhood also considered async client libraries but chose the proxy model for the isolation benefits it provided.
Decoupling consumer lifecycle from application code simplifies failure handling. Robinhood's Consumer Proxy moves all consumer group logic, rebalancing, commits, and timeouts into a separate sidecar. Your application container receives messages over gRPC only after they have been successfully consumed. This means application failures do not stall consumer groups, and consumer group rebalances triggered by scaling do not introduce latency into your application.
A Postgres-backed dead letter queue gives you more repair tooling than a DLQ topic. A secondary Kafka topic stores failed messages but does not give you a way to query them by failure reason, apply a fix, or selectively replay a subset. If your teams share schema contracts and schema mismatches are a real failure mode, the ability to inspect and fix records before replay is worth the additional infrastructure.
Separating workloads by their latency and cost profiles pays off at scale. Robinhood's logging traffic and its trading traffic have very different latency requirements: 0.45 seconds is acceptable for logs but not for order routing. Migrating logging to WarpStream (with higher latency but lower cost) while keeping trading on self-managed Kafka let the team optimise each cluster for its actual workload, producing a 45% cost reduction on the logging side without touching the trading path.
A standardised internal Kafka client library is worth the investment early. kafkahood encodes sane defaults, observability, DLQ support, and feature-flag-controlled rollouts for all services. Teams get correct client behaviour without needing to understand all the underlying librdkafka configuration options. The payoff is proportional to the number of services you run: at 14+ million users and hundreds of services, a single misconfigured default can cause widespread consumer lag or data loss.
Sources and further reading
If you are running Kafka in production and want visibility into consumer lag, throughput, and cluster health across your brokers, give Kpow a try with a free 30-day trial. You can connect it to any Kafka cluster in minutes and deploy it via Docker, Helm, or JAR.