Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Robinhood uses Apache Kafka in production

May 16th, 2026

xx min read

Robinhood's Kafka platform handles 2.2 million messages per second across 6 clusters, 160 brokers, and 90,000 partitions, underpinning every step of its trading lifecycle from order routing to fraud detection to data lake ingestion. What makes its story particularly instructive is not the scale alone but the engineering choices the team made to reach it: a custom sidecar proxy to solve Python's connection fan-in problem, a Postgres-backed dead letter queue that trades Kafka simplicity for richer replay tooling, and a 2025 decision to replace self-managed Kafka with WarpStream for logging workloads, cutting total costs by 45%.

Across fintech, Kafka is common. At Robinhood the challenge was operating a large, heterogeneous fleet of Python-based Kafka applications at financial-grade reliability without overwhelming a small platform team.

Company overview

Robinhood is a US-based brokerage and financial services platform offering commission-free trading in equities, ETFs, options, and cryptocurrency, along with a debit card and cash management product. As of 2025, the platform serves more than 14 million monthly active users.

Kafka has been part of Robinhood's infrastructure since at least 2015, when engineer Jaren Glover joined and was assigned ownership of stabilising an existing Kafka, ZooKeeper, and Elasticsearch pipeline. Over the following four years, Glover scaled that infrastructure from 100,000 to 10 million users, presenting what he learned at SREcon Americas and Kafka Summit in 2019.

By 2017, the team had begun developing Faust, an open-source Python stream-processing library built on top of Apache Kafka. Faust was open-sourced in July 2018 and reflected how central Kafka had become to Robinhood's product engineering.

Date	Milestone
2015	Kafka already in production; Jaren Glover takes ownership of stabilising the pipeline
2015-2019	Platform scales from 100,000 to 10 million users
2017-2018	Faust Python stream-processing library developed and open-sourced
2019	Data lake pipeline published: Kafka as primary databus, Secor archiving to S3
2019-2020	Brokerage system grows from 100k to 750k peak requests/second in six months; shard-aware Kafka filtering introduced
2021	Kafka Summit Americas: 6 clusters, 160 brokers, 1,400 topics, 90,000 partitions, 2.2M msg/sec confirmed publicly
2022	CDC pipeline using Debezium and Apache Hudi reduces data lake freshness from 24 hours to 5-15 minutes
2023	Consumer Proxy (kafkaproxy v2, Java + gRPC) presented at Confluent Current
2024	Kafka clusters migrated from EC2 to Kubernetes using a custom in-house operator
2025	Logging workloads migrated from self-managed Kafka to WarpStream; 45% total cost reduction

Robinhood's Kafka use cases

The Streaming Platform team describes Kafka as involved in "almost every mission-critical step of Robinhood's functionality." There are 14 confirmed production use cases across product, data, and infrastructure teams.

Equities order routing and trade execution. Stock purchase events are written to both Kafka and Postgres. Multiple downstream services consume those events and process them using Kafka Streams with exactly-once semantics. The trade execution confirmation SLA is under one second, with 5-10 Kafka hops per trade depending on product type.

Crypto trading. The crypto backend uses shard-aware Kafka consumers. Each shard only processes messages belonging to users assigned to it, limiting the blast radius of any single consumer failure.

Self-clearing. Robinhood's in-house clearing operations are listed as a distinct Kafka use case by the Streaming Platform team, though detailed architecture has not been published.

Market data distribution. External market data feeds are ingested via Kafka and consumed by internal Faust stream-processing applications.

Push notifications and messaging. Named explicitly as a production use case in the 2021 Kafka Summit Americas talk by Chandra Kuchi and Nick Dellamaggiore.

Fraud detection and risk. Faust processes risk signals in real time. Apache Flink handles stateful stream processing for fraud detection and shareholder position tracking, with more than 100 concurrent Flink deployments running in production.

Order execution quality monitoring and ad tracking. Both are listed as production Faust use cases, alongside newsfeed aggregation.

CDC into the data lake. Debezium captures WAL changes from thousands of Postgres RDS tables, encodes them as Avro records via Confluent Schema Registry, and writes them to Kafka. Apache Hudi Deltastreamer consumes those records with exactly-once semantics and writes into S3. Data freshness for core datasets improved from 24 hours to 5-15 minutes.

Card transaction authorisation (backup path). The backup payment authorisation service subscribes to asynchronous Kafka streams to cache each cardholder's latest account state. The service must produce a decision within a 2-second window.

Investor eligibility (Say Technologies acquisition). Apache Flink determines investor eligibility with approximately 2-minute end-to-end latency.

Log analytics and observability. Application and infrastructure logs flow through Kafka (and since 2025, WarpStream) via the Vector log shipper to Humio for search and storage.

Scale and throughput

The figures below come from Chandra Kuchi and Nick Dellamaggiore's Kafka Summit Americas 2021 presentation, reflecting the state of the platform at that time, with more recent additions from a 2025 WarpStream case study and a September 2024 Diginomica interview with Kuchi.

Metric	Value	Source / date
Kafka clusters	6	Kafka Summit Americas 2021
Brokers (total)	160	Kafka Summit Americas 2021
Topics	1,400	Kafka Summit Americas 2021
Partitions	90,000	Kafka Summit Americas 2021
Messages per second (all clusters)	2.2 million	Kafka Summit Americas 2021
API requests per second (largest cluster)	500,000	Kafka Summit Americas 2021
Inbound connections (largest cluster)	>100,000	Kafka Summit Americas 2021
Data processed per day	10+ TB	WarpStream case study, 2025
Monthly active users	14+ million	WarpStream case study, 2025
Concurrent Flink deployments	100+	Tony Chen, Current 2024
Postgres tables in CDC pipeline	Thousands	Robinhood Engineering blog, 2022
Inter-service latency (Kafka-connected)	20-50 ms	Chandra Kuchi, Diginomica 2024
Brokerage peak load growth (Dec 2019 to Jun 2020)	100k to 750k requests/sec (7x)	Edmond Wong and Nathan Ziebart, Robinhood blog, 2021
Faust throughput (2018 estimate)	Billions of events and terabytes of data per day	Ask Solem, Medium, 2018

Robinhood's Kafka architecture

Cluster topology and hosting

Robinhood originally ran self-managed Confluent Platform on AWS EC2, provisioned with Saltstack. Brokers were spread across three AWS availability zones per cluster.

By 2024, the team had migrated all clusters to Kubernetes. Rather than adopting an existing Kafka Kubernetes operator, Robinhood built a custom in-house operator to manage the migration and ongoing operations. Mun Yong Jang and Nathan Moderwell presented details of this migration at Confluent Current 2024. The three-AZ layout was preserved in the new Kubernetes deployment.

Logging workloads were separated from the main Kafka clusters in 2025 and migrated to WarpStream, a diskless, S3-backed Kafka-compatible broker. WarpStream runs as three independent Kubernetes deployments (one per AZ), each with its own horizontal pod autoscaler, so logging capacity tracks market-hours traffic without manual intervention. Ethan Chen and Renan Rueda from Robinhood presented this architecture at Confluent Current New Orleans 2025.

Cloud provider: AWS throughout. S3 serves as primary data lake storage and as WarpStream's backing store. Postgres runs on AWS RDS.

Schema management

Confluent Schema Registry is used for Avro encoding on the CDC pipeline. Debezium captures Postgres WAL and writes Avro-encoded change records to Kafka via the Schema Registry before downstream consumers read them into the data lake.

Producer architecture

All Python and Golang Kafka clients at Robinhood use kafkahood, an internal wrapper library built on librdkafka. kafkahood codifies standard defaults, client-side metrics collection, dead letter queue support, message serialisation, and feature-flag-based configuration rollouts. This means the Streaming Platform team can push new client defaults across the entire application fleet by flipping a feature flag rather than waiting for individual teams to update their library versions.

Producer connections from Python services are routed through kafkaproxy (described in detail in the Special Techniques section below).

Consumer architecture

Consumer lifecycle management is handled by the Consumer Proxy (kafkaproxy v2), a Java-based Kubernetes sidecar that runs alongside each application container. The proxy handles all Kafka consumer group coordination, rebalancing, commit management, and timeout management. It relays messages to the application container over gRPC. This design means a consumer group rebalance triggered by a scaling event does not affect the application container's business logic, and a failure in the application container does not orphan an uncommitted Kafka offset.

kafkahood also provides a Postgres-backed dead letter queue for failed consumer events. Failed messages are inserted into Postgres rather than a secondary Kafka topic, and a CLI allows engineers to inspect, query, and fix records before republishing them to a retry topic.

Consumer lag, throughput, and error rates are surfaced per application as standard defaults through kafkahood's client-side metrics, without each team needing to add their own instrumentation.

Stream processing

Faust (Python). Robinhood open-sourced Faust in 2018 as a Python port of the Kafka Streams API. In production, Faust applications handle risk signal processing, order quality monitoring, market data feed processing, newsfeed aggregation, ad tracking, event logging, and crypto feed processing. Faust uses RocksDB as a local embedded state store and aiokafka (which Robinhood also maintains as a fork) as its async Kafka client.

Apache Flink. Stateful stream processing for fraud detection, shareholder position tracking, investor eligibility determination, and data ingestion is handled by Flink. Robinhood runs more than 100 concurrent Flink deployments, managed by the Apache flink-kubernetes-operator. Tony Chen from the Streaming Platform team presented Robinhood's Flink deployment practices, including a custom Checkpoint-Fetcher tool for automating recovery, at Confluent Current 2024.

Kafka Connect ecosystem

Debezium is used as the primary Kafka Connect source connector, capturing CDC events from Postgres RDS. It encodes changes as Avro records using Confluent Schema Registry and writes them to Kafka for consumption by Apache Hudi Deltastreamer on the data lake side. Secor (originally open-sourced by Pinterest) archives Kafka streams to S3 for the batch path of the data lake.

Special techniques and engineering innovations

Kafkaproxy as a sidecar for connection aggregation. Python's process-based concurrency model creates a problem at Kafka scale: each gunicorn worker process opens its own set of TCP connections to brokers, which at Robinhood's cluster sizes produced more than 100,000 inbound connections on the largest cluster. The first-generation kafkaproxy was a Rust-based Kubernetes sidecar that ran alongside each application pod and aggregated all producer and consumer connections from that pod's worker pool into a single connection set to the brokers. This cut broker-side resource pressure from Python services substantially.

Consumer Proxy with gRPC message relay (v2). The second generation of kafkaproxy, presented by Tony Chen and Mun Yong Jang at Confluent Current 2023, replaced the Rust implementation with a Java-based sidecar using the standard Apache Kafka client library. The v2 proxy takes full ownership of the Kafka consumer lifecycle: it handles consumer group rebalancing, commit management, and timeout management, then relays raw messages to the application container over gRPC. This decoupling means the application team no longer needs to reason about consumer group behaviour, and the platform team maintains a single Java consumer implementation rather than N language-specific clients.

Shard-aware Kafka message routing. During the 2019-2020 period when brokerage traffic grew 7x, Robinhood introduced application-level sharding over Postgres. Each shard's Kafka consumers filter messages on shared topics to only process events belonging to users assigned to that shard. For high-throughput flows where the per-message filtering overhead became significant, Robinhood created shard-specific Kafka topics to eliminate the filtering cost entirely.

CDC with exactly-once semantics into the data lake. The CDC pipeline chains Debezium (Postgres WAL capture), Confluent Schema Registry (Avro encoding), Kafka (transport), and Apache Hudi Deltastreamer (exactly-once writes into S3). Copy-on-write mode is used for raw tables to optimise columnar read performance. This pipeline reduced data freshness from 24 hours to 5-15 minutes across thousands of Postgres tables.

Postgres-backed dead letter queue. Rather than routing failed consumer events to a secondary Kafka topic, Robinhood writes them to a Postgres database. A client application with a CLI allows engineers to inspect and query failed records, apply fixes, and republish messages to a Kafka retry topic for safe reprocessing. Sreeram Ramji and Wenlong Xiong, who presented this pattern at Confluent Current 2022, noted that cross-team schema contracts make silent consumer failures particularly hazardous at Robinhood, which drove the choice of a more inspectable store over a native Kafka DLQ topic.

WarpStream for elastic logging clusters. Log throughput at Robinhood is highly cyclical: it peaks during US market hours and drops significantly in evenings and on weekends. Running fixed-capacity Kafka clusters for this workload was costly and wasteful. By migrating logging to WarpStream, the team eliminated inter-AZ data transfer costs entirely (WarpStream's diskless architecture does not move data between availability zones) and gained per-AZ horizontal pod autoscaling. The result was a 45% reduction in total logging infrastructure cost, broken down as 36% compute savings, 13% storage savings, and 99% inter-AZ networking savings. The trade-off was an increase in end-to-end latency from 0.2 to 0.45 seconds, which was acceptable for the logging workload.

Feature-flag-based client library rollouts. kafkahood uses feature flags to control adoption of new client configuration defaults across the application fleet. A new default setting can be tested on a subset of services before rolling out broadly, with fast rollback if it causes latency or throughput regressions.

Operating Kafka at scale

Deployment model. Robinhood self-manages its Kafka infrastructure throughout its history, initially on EC2 with Saltstack and since 2024 on Kubernetes with a custom in-house operator. The WarpStream deployment for logging is also self-hosted (Kubernetes) rather than using WarpStream Cloud.

Observability. Client-side metrics collection is built into kafkahood as a standard default for all Python and Golang services. This provides per-application consumer lag, throughput, and error rates without requiring each team to instrument independently. The Vector log shipper routes application and infrastructure logs to Kafka (or WarpStream) for forwarding to Humio.

WarpStream Agent Groups for multi-tenant isolation. Within the logging WarpStream cluster, Agent Groups isolate traffic classes: VIP clients, SSL/TLS connections, and plaintext connections each run in separate groups. This prevents a high-throughput producer from saturating broker resources shared with other clients.

Flink checkpoint management. The Apache flink-kubernetes-operator had stability issues when reconciling more than 100 concurrent Flink deployments simultaneously. Robinhood tuned the operator's concurrent reconciliation configuration and built a custom Checkpoint-Fetcher tool that automates locating the latest valid Flink checkpoint and updating the initialSavepointPath field in the Kubernetes manifest. This reduces human error during recovery and rollback operations. Automatic rollback to the last healthy application version via health checks is also implemented.

Confluent Platform usage. Chandra Kuchi confirmed in a September 2024 interview that Robinhood uses Confluent CP. The current status of Confluent's commercial components post-Kubernetes migration has not been detailed in public sources.

Note: Robinhood's SRE team has not published details of Kafka alerting thresholds, SLO definitions, quota management, or CI/CD pipelines for topic provisioning in any source reviewed for this article.

Challenges and how they solved them

Python connection fan-in at broker scale. Robinhood's Python services use the gunicorn process-based worker model. Each worker process opens independent Kafka connections, and at hundreds of workers per cluster the broker inbound connection count exceeded 100,000 on the largest cluster, creating resource exhaustion and connection management overhead. The solution was kafkaproxy: a Kubernetes sidecar that aggregates all connections from a pod's worker pool into a shared set, dramatically reducing broker-side connection counts. A second generation of the proxy (in Java) later extended this to full consumer lifecycle management via gRPC.

Divergent Kafka client libraries across languages. Separate Kafka client wrappers for Python and Golang were drifting apart in defaults, error handling, and observability. kafkahood standardised client behaviour across languages with a single librdkafka-based library. The v2 consumer proxy went further, centralising the consumer group logic in one Java implementation that serves both Golang and Python application containers via gRPC.

Failed consumer events with no safe retry path. Schema mismatches and deserialization errors cause consumer events to fail permanently unless there is a way to inspect and fix the message before replaying it. A secondary Kafka DLQ topic does not provide sufficient tooling for this. Robinhood built a Postgres-backed DLQ with a CLI for inspection and targeted republishing to a retry topic.

Data lake freshness of 24 hours. Batch ingestion from Postgres into S3 meant analytics consumers worked with day-old data. The CDC pipeline (Debezium, Confluent Schema Registry, Kafka, Apache Hudi Deltastreamer with exactly-once writes) reduced freshness to 5-15 minutes across thousands of Postgres tables.

7x brokerage traffic growth in six months. Requests grew from 100k to 750k per second between December 2019 and June 2020, overwhelming a single-shard brokerage system. Application-level sharding distributed load across Postgres shards, each with its own Kafka consumers. High-throughput flows received shard-specific Kafka topics to eliminate per-message filtering overhead.

Logging infrastructure cost and elasticity. Self-managed Kafka clusters sized for peak market-hours throughput sat largely idle in evenings and weekends, and incurred inter-AZ data transfer fees continuously. Migrating to WarpStream eliminated the inter-AZ networking cost entirely (99% reduction) and allowed per-AZ autoscaling to track actual throughput. Total cost fell 45%.

Managing 100+ concurrent Flink deployments. The Apache flink-kubernetes-operator struggled with concurrent reconciliation at this scale, and manual checkpoint management introduced human error during recovery. Robinhood tuned the operator's concurrency settings, built the Checkpoint-Fetcher tool to automate manifest updates, and implemented health-check-based automatic rollback.

Full tech stack

Category	Tool	Role at Robinhood
Message broker	Apache Kafka (Confluent Platform)	Primary event broker across equities trading, crypto, self-clearing, market data, notifications, fraud detection, data lake ingestion
Message broker (logging)	WarpStream	Diskless, S3-backed Kafka-compatible broker for logging workloads; adopted 2025
Schema registry	Confluent Schema Registry	Schema management and Avro encoding for CDC change records from Debezium
Stream processing (Python)	Faust (open-sourced by Robinhood)	Risk/fraud detection, order quality monitoring, newsfeed aggregation, ad tracking, market data feed, crypto feed, event logging
Stream processing (JVM)	Apache Flink	Stateful processing: fraud detection, shareholder position tracking, investor eligibility; 100+ concurrent deployments
Kafka client wrapper	kafkahood (internal)	librdkafka-based library for Python and Golang; standardised defaults, client metrics, DLQ support, feature-flag rollouts
Consumer proxy	kafkaproxy / Consumer Proxy (internal)	Kubernetes sidecar (v1: Rust; v2: Java + gRPC) managing Kafka consumer lifecycle and connection aggregation
CDC connector	Debezium (Kafka Connect)	Captures WAL from Postgres RDS; writes Avro-encoded change records to Kafka
Data lake ingestion	Apache Hudi (Deltastreamer)	Exactly-once incremental writes from Kafka into S3 data lake
Data lake archival	Secor (Pinterest)	Archives Kafka streams to S3 for batch processing path
Batch processing	Apache Spark	Batch workloads on the data lake; also used for Hudi ingestion jobs
Query engine	Trino	SQL queries over the S3 data lake
Workflow orchestration	Apache Airflow	Orchestrates data pipeline jobs
Kubernetes operator	Apache Flink Kubernetes Operator	Manages 100+ concurrent Flink application deployments
Log shipping	Vector (Datadog)	Produces application and infrastructure logs to Kafka / WarpStream
Log storage and search	Humio	Receives logs from the Kafka/WarpStream pipeline
Embedded state store	RocksDB	Local state store used by Faust for stateful stream processing
Async Kafka client	aiokafka (Robinhood fork)	Async Kafka client used by Faust; maintained as a fork by Robinhood
Object storage	AWS S3	Data lake primary storage; backing store for WarpStream
Primary database	AWS RDS (Postgres)	Source for CDC pipeline; also backing store for Postgres DLQ
Container orchestration	Kubernetes	Hosts all services, Kafka clusters (post-2024 migration), and Flink deployments
Metastore	Apache Hive Metastore	Schema management for data lake tables
Cluster manager	Apache Hadoop YARN	Cluster management for Spark jobs

Key contributors

Name	Role	Contribution
Jaren Glover	Ops engineer, Robinhood (c. 2015-2019)	Stabilised and scaled Kafka infrastructure from 100,000 to 10 million users; presented at SREcon Americas 2019 and Kafka Summit 2019
Ask Solem	Software engineer, Robinhood	Created Faust, the Python stream-processing library built on Kafka; open-sourced July 2018
Vineet Goel	Software engineer, Robinhood	Co-authored the Faust introductory blog post and the "Adding Faust to your Existing Architecture" post
Chandra Kuchi	Engineering Manager, Data Infrastructure Lead, Streaming Platform	Presented "Taming a Massive Fleet of Python-based Kafka Apps at Robinhood" at Kafka Summit Americas 2021; cited in Diginomica (2024) on Robinhood's Kafka and Flink architecture
Nick Dellamaggiore	Software engineer, Kafka Infrastructure, Robinhood	Co-presented at Kafka Summit Americas 2021
Tony Chen	Software engineer, Streaming Platform, Robinhood	Presented "Robinhood's Kafkaproxy" at Current 2023 and "Robinhood's Flink Deployment Practices" at Current 2024
Mun Yong Jang	Software engineer, Streaming Platform, Robinhood	Co-presented kafkaproxy at Current 2023; co-presented "Robinhood's Kafka Journey from EC2 to Kubernetes" at Current 2024
Nathan Moderwell	Software engineer, Robinhood	Co-presented Kafka EC2-to-Kubernetes migration at Current 2024
Sreeram Ramji	Senior Staff Software Engineer, Data Infra and Experimentation	Co-presented "Dead Letter Queues for Kafka Consumers in Robinhood" at Current 2022; owns Kafka, Flink, Spark, Trino, Airflow, and Data Lake infrastructure
Wenlong Xiong	Software engineer, Streaming Platform and Batch Compute (formerly)	Co-presented DLQ architecture at Current 2022
Ethan Chen	Software engineer, Robinhood	Co-presented WarpStream logging migration at Current New Orleans 2025
Renan Rueda	Software engineer, Robinhood	Co-presented WarpStream logging migration at Current New Orleans 2025
Stephen Chang	Engineering Manager, Payments, Robinhood	Authored "Building a Resilient Card Transaction System" (2022), covering Kafka's role in backup payment authorisation
Edmond Wong and Nathan Ziebart	Technical Leads, Brokerage Engineering, Robinhood	Co-authored "How we scaled Robinhood's brokerage system for greater reliability" (2021), covering shard-aware Kafka filtering

Key takeaways for your own Kafka implementation

Connection fan-in is a real problem for process-based concurrency models. If you run Python services with gunicorn or uWSGI and connect them directly to Kafka, each worker process opens its own connections. At a few hundred workers per cluster this adds up to tens of thousands of inbound connections on your brokers. A sidecar proxy that aggregates connections from a single pod is one approach; Robinhood also considered async client libraries but chose the proxy model for the isolation benefits it provided.

Decoupling consumer lifecycle from application code simplifies failure handling. Robinhood's Consumer Proxy moves all consumer group logic, rebalancing, commits, and timeouts into a separate sidecar. Your application container receives messages over gRPC only after they have been successfully consumed. This means application failures do not stall consumer groups, and consumer group rebalances triggered by scaling do not introduce latency into your application.

A Postgres-backed dead letter queue gives you more repair tooling than a DLQ topic. A secondary Kafka topic stores failed messages but does not give you a way to query them by failure reason, apply a fix, or selectively replay a subset. If your teams share schema contracts and schema mismatches are a real failure mode, the ability to inspect and fix records before replay is worth the additional infrastructure.

Separating workloads by their latency and cost profiles pays off at scale. Robinhood's logging traffic and its trading traffic have very different latency requirements: 0.45 seconds is acceptable for logs but not for order routing. Migrating logging to WarpStream (with higher latency but lower cost) while keeping trading on self-managed Kafka let the team optimise each cluster for its actual workload, producing a 45% cost reduction on the logging side without touching the trading path.

A standardised internal Kafka client library is worth the investment early. kafkahood encodes sane defaults, observability, DLQ support, and feature-flag-controlled rollouts for all services. Teams get correct client behaviour without needing to understand all the underlying librdkafka configuration options. The payoff is proportional to the number of services you run: at 14+ million users and hundreds of services, a single misconfigured default can cause widespread consumer lag or data loss.

Sources and further reading

Title	Author(s)	Venue	Date
Robinhood swaps Kafka for WarpStream to tame logging workloads and costs	Ethan Chen, Renan Rueda (Robinhood); Jason Lauritzen (WarpStream)	WarpStream blog / Current New Orleans 2025	2025
Taming a Massive Fleet of Python-based Kafka Apps at Robinhood	Chandra Kuchi, Nick Dellamaggiore (Robinhood)	Kafka Summit Americas 2021	2021
Robinhood's Kafkaproxy: Decoupling Kafka Consumer Logic from Application Business Logic	Tony Chen, Mun Yong Jang (Robinhood)	Confluent Current 2023	2023
Dead Letter Queues for Kafka Consumers in Robinhood	Sreeram Ramji, Wenlong Xiong (Robinhood)	Confluent Current 2022	2022
Faust: Stream Processing for Python	Ask Solem, Vineet Goel (Robinhood)	Robinhood Engineering / Medium	2018-07-31
Robinhood's Kafka Journey from EC2 to Kubernetes	Mun Yong Jang, Nathan Moderwell (Robinhood)	Confluent Current 2024	2024
Robinhood's Flink deployment practices	Tony Chen (Robinhood)	Confluent Current 2024	2024
Data lake at Robinhood	Robinhood Engineering	Robinhood Newsroom	2019-09-09
Robinhood invests in a stable and scalable future with Confluent	Derek du Preez (interviewing Chandra Kuchi)	Diginomica	2024-09-18
Scaling Robinhood crypto systems	Chirantan Mahipal, Hefu Chai, Xuan Zhang (Robinhood)	Robinhood Newsroom	2022-09-09
Fresher data lake on AWS S3	Robinhood Engineering	Robinhood Newsroom	2022-02-18
Building a resilient card transaction system	Stephen Chang (Robinhood)	Robinhood Newsroom	2022-10-12
How we scaled Robinhood's brokerage system for greater reliability	Edmond Wong, Nathan Ziebart (Robinhood)	Robinhood Newsroom	2021-06-25
Faust: streaming at Robinhood with Ask Solem	SE Daily (interview with Ask Solem)	Software Engineering Daily	2018-09-05

‍

If you are running Kafka in production and want visibility into consumer lag, throughput, and cluster health across your brokers, give Kpow a try with a free 30-day trial. You can connect it to any Kafka cluster in minutes and deploy it via Docker, Helm, or JAR.

Kafka cluster monitoring

A detailed guide to Kafka producer monitoring

Kafka consumer monitoring and performance tuning

Kafka broker monitoring

Apache Kafka architecture: a complete guide to internals, components, and deployment

Kafka monitoring: a complete guide for platform engineers

Data governance for Kafka: introducing lineage support in Factor Platform

Defense in depth: unifying RBAC and data policies for transparent governance

CMAK: Review, pricing, and best alternatives in 2026

Kadeck: Review, pricing, and best alternatives in 2026

AKHQ: Review, pricing, and best alternatives in 2026

Redpanda Console: Review, pricing, and best alternatives in 2026