Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Netflix uses Apache Kafka in production

Table of contents

Factor House
May 11th, 2026
xx min read

Netflix's Kafka story is one of the more thoroughly documented in the industry. Starting with a single pipeline routing 30% of event traffic in early 2015, Netflix has grown to a system processing over 2 trillion events per day across thousands of brokers — with 20,000-plus Flink jobs connected by Kafka topics at the core of its real-time data infrastructure.

The engineering problem Kafka solves at Netflix is scale and decoupling across a massive, distributed microservice estate. Every interaction a member has — a play, a search, a scroll, a title impression — needs to reach multiple downstream systems within seconds. Kafka sits at the centre of that flow.

Company overview

Netflix is a streaming entertainment service operating in over 190 countries, with around 300 million subscribers as of 2025. At its scale, real-time data is not a feature; it drives personalisation, content decisions, studio operations, and service reliability. The platform generates enormous event volumes at all hours across all regions, which means the data infrastructure has to match both the throughput and the latency requirements of those use cases.

Netflix adopted Apache Kafka incrementally. Before Kafka, event data flowed through a system called Suro, a proprietary pipeline open-sourced in 2013. Kafka entered the picture in early 2015 as part of Keystone V1.5 — initially handling about 30% of event traffic alongside Chukwa for real-time consumers. By December 2015, Keystone V2 was in production with Kafka as the primary ingest layer, replacing Chukwa entirely.

Date Milestone
2013 Suro open-sourced — Netflix's pre-Kafka event pipeline
Feb 2015 Keystone V1.5 — Kafka introduced, handling ~30% of event traffic
Dec 2015 Keystone V2 live — Kafka becomes the primary fronted gateway
Apr 2016 36 clusters, 4,000+ brokers, 700 billion messages per day
Mar 2018 3,000+ brokers globally, 1 trillion messages per day, 99.99% availability
Sep 2019 Inca — sampled distributed tracing system on dedicated Kafka cluster published
2022 Data Mesh platform: 2 trillion+ events per day, 20,000+ Flink jobs
Nov 2023 Streaming SQL in Data Mesh: 1,200 SQL processors created in first year
Jul 2025 Tudum CQRS architecture replaced — Kafka retired for one read-heavy use case
Oct 2025 Real-Time Distributed Graph published: up to 1 million messages per second per topic
Dec 2025 Live streaming monitoring stack: 38 million events per second peak capacity

Netflix's Kafka use cases

Netflix uses Kafka across several distinct domains. The use cases range from high-volume, low-latency event ingest to event-driven microservice coordination and real-time member personalisation.

Keystone pipeline — centralised event ingest and routing

The Keystone pipeline is Netflix's primary data highway. Every interaction a member generates on any device flows through Keystone: plays, pauses, searches, row scrolls, title impressions. Kafka serves as the front door, ingesting these events from client applications and routing them to downstream sinks including Amazon S3, Elasticsearch, Apache Cassandra, and secondary Kafka topics.

The routing tier runs Flink and Samza jobs on EC2, processing events in near real time and updating member taste profiles in Cassandra within seconds of the originating interaction.

Stream processing as a service (Keystone SPaaS)

Keystone's SPaaS tier makes Flink-over-Kafka available to internal teams as a managed service. Teams submit stream processing jobs without owning infrastructure. Use cases include generating real-time features for recommendation models (replacing 24-hour batch jobs), aggregating A/B test results, and building near-real-time personalisation pipelines.

Nitin Sharma, who led Content Finance infrastructure at Netflix, described the core value: "We're able to replace batch workflows that ran once a day with streaming pipelines that update model inputs continuously."

Data Mesh — inter-service data movement and processing

The Data Mesh platform, described in a 2022 Netflix TechBlog post, uses Kafka topics as the connective tissue between processing stages. Every upstream Flink Processor writes output to a Kafka topic; every downstream Processor reads from one. The platform manages topic provisioning, schema registration, and job lifecycle.

As of 2022, Netflix operates 20,000-plus Flink jobs connected by thousands of Kafka topics through this platform. A Streaming SQL layer added in 2023 wraps the Flink Table API behind standard SQL, allowing non-infrastructure teams to build pipelines without writing custom Processors. Within the first year of its launch, 1,200 SQL processors had been created by teams outside the data infrastructure organisation.

Content Finance — event-driven microservices

Netflix's Content Finance engineering team (budgeting, talent payments, content accounting, cashflow) migrated from synchronous RPC calls to a Kafka-based event-driven architecture. Services now emit standardised events containing an entity ID, UUID, timestamp, operation type, and optional payload. Flink enriches and reorders these events before routing them to keyed Kafka topics consumed by Spring Boot services.

Before the migration, synchronous dependencies between services created cascading failures and split-brain state when an upstream service became unavailable. After the migration, at-least-once delivery with idempotent consumers eliminated the reconciliation overhead that had required manual intervention.

Monitoring and observability

Netflix uses Kafka at multiple layers of its observability stack.

The Title Health microservice ingests real-time title impression events via Kafka to validate content availability — artwork, recommendations, localisation — across thousands of monthly launches. Collector jobs process these events and surface problems before a title reaches significant audience.

The live streaming monitoring stack, introduced as Netflix expanded into live events, uses Kafka alongside Mantis, Atlas, and Druid to process up to 38 million events per second and deliver critical playback quality metrics within seconds. This stack was designed to detect degradation at the speed live events require.

Write-Ahead Log — database resilience

Netflix built a Write-Ahead Log (WAL) abstraction that uses Kafka (alongside Amazon SQS) as a pluggable message queue to decouple producers from consumers in database mutation flows. Each WAL namespace gets a dedicated Kafka topic and a dead-letter queue by default. The WAL supports delay queues, cross-region replication, and multi-table atomic mutations. The most common deployment pattern is as a delay queue for scheduled database writes.

Real-Time Distributed Graph — member taste profiles

Netflix's Real-Time Distributed Graph (RDG) uses Kafka to carry member actions — views, ratings, interactions — into Flink jobs that populate a graph database representing member preferences. Kafka provides durable, replayable streams so downstream processors can consume events in real time or replay from an earlier offset if needed. Individual Kafka topics in the RDG system carry up to roughly 1 million messages per second.

Scale and throughput

Netflix's published scale figures show consistent growth over a decade:

Year Metric Source
2016 36 clusters, 4,000+ broker instances, 700+ billion messages per day Netflix TechBlog, "Kafka Inside Keystone Pipeline"
2017 700+ Kafka topics in production, 450 billion events per day Shriya Arora, QCon New York 2017
2018 3,000+ brokers globally, 1 trillion messages per day, 99.99% availability Allen Wang, QCon London 2018
2018 Up to 8 million events per second peak throughput Netflix TechBlog, "Keystone Real-time Stream Processing Platform"
2019 2 billion trace messages per day through Inca (dedicated Kafka cluster) Allen Wang, Netflix TechBlog
2022 2 trillion+ events per day, 20,000+ Flink jobs connected by Kafka topics Netflix TechBlog, "Data Mesh"
2023 1,200 SQL processors created in the first year of Streaming SQL launch Netflix TechBlog, "Streaming SQL in Data Mesh"
2025 Up to ~1 million messages per second per Kafka topic (RDG system) Netflix TechBlog, "Real-Time Distributed Graph"
2025 38 million events per second peak capacity (live streaming monitoring) InfoQ, December 2025

Retention policies are tailored per topic based on throughput and record size. In the Keystone batch path, Kafka TTL was set to 4-6 hours with raw events backed up in HDFS and S3 for 1-2 days to support replay when Kafka retention had expired. More recent architectures use Apache Iceberg tables as a backfill source for pipelines that need data beyond Kafka's retention window.

Netflix's Kafka architecture

High-level topology

Netflix operates entirely on AWS and chose a multi-cluster topology specifically designed for the cloud's characteristics. Rather than scaling a single large cluster, Netflix provisions many smaller, mostly immutable clusters. This limits blast radius, simplifies broker upgrades, and aligns with AWS's preference for stateless, disposable infrastructure.

The Keystone pipeline uses a two-tier cluster layout:

  • Fronting Kafka clusters receive events from all producers. Their purpose is ingest.
  • Consumer Kafka clusters serve downstream consumers. Their purpose is read throughput.

Separating these tiers prevents consumer fan-out from affecting producer ingest performance. As the number of downstream consumer groups grew, serving them all from a single cluster became impractical.

Custom smart clients wrap the standard Kafka producer and consumer interfaces. Producers write to fronting clusters; consumers read from consumer clusters. The smart client handles routing and failover transparently, so application code interacts with a logical endpoint rather than a specific cluster.

Schema management

All Kafka events at Netflix use Apache Avro serialisation. Before any event can be written to a topic, its Avro schema must be registered in an internal Schema Registry. This enforces schema-at-producer: malformed or undeclared events cannot enter the pipeline. Avro provides roughly 3-5x size reduction versus JSON, which matters at multi-trillion-event-per-day volumes.

Producer architecture

The Keystone ingest layer accepts events via both a Java library embedded in producer applications and an HTTP proxy for services or devices that cannot embed the library. Producers target the fronting tier. Configuration details for acks, batching, and idempotency have not been published in detail in Netflix's public posts, but the shift to rack-aware partition assignment (see Special Techniques below) indicates careful tuning of cross-AZ placement.

Consumer architecture

Consumer groups read from the consumer tier clusters. Offset management follows standard Kafka committed-offset patterns, with offset-based lag monitoring feeding the Atlas metrics platform via a custom Kafka MetricReporter. Netflix derives consumer lag by calculating log end offsets minus committed offsets per topic-partition.

A "stuck consumer" detection mechanism flags cases where committed consumer offsets stall within a given window, triggering First Responder, Netflix's auto-remediation system, to relaunch affected stateless jobs.

Stream processing

Flink is the primary stream processing engine throughout Netflix's Kafka infrastructure. In the Keystone SPaaS tier, Flink jobs run in Docker containers on EC2 and are managed by the platform's reconciliation layer. The Data Mesh platform connects thousands of Flink Processors via Kafka topics, with job state stored in Amazon RDS as the authoritative source of truth.

A Streaming SQL layer (added in 2023) wraps the Flink Table API and translates SQL queries into Flink job graphs, eliminating the need for multiple intermediate Processors and Kafka topics for simple transformations.

Kafka Connect ecosystem

Netflix's public posts do not describe Kafka Connect specifically. The routing layer in Keystone writes to S3 and Elasticsearch using managed Flink/Samza jobs rather than off-the-shelf connectors, suggesting custom integration rather than a standard Connect deployment.

Special techniques and engineering innovations

Rack-aware partition assignment

Allen Wang contributed Kafka's rack-aware partition assignment feature upstream to Apache Kafka. The feature assigns partitions to brokers and consumers in the same AWS availability zone where possible, reducing cross-AZ data transfer costs and latency. At Netflix's scale — thousands of brokers across multiple AZs — the economics of cross-AZ data transfer are material.

Immutable, small-cluster design

Netflix deliberately chose many small immutable clusters over one large cluster. In Allen Wang's words from QCon London 2018, the decision reflects the reality that stateful services are harder to operate in the cloud than stateless ones. Smaller clusters limit the blast radius of failures and make rolling upgrades more predictable. The smart client layer abstracts this topology from application code.

External heartbeat monitoring for Kafka

Netflix's monitoring service continuously sends heartbeat messages to each Kafka cluster and simultaneously consumes them from the outside. This verifies both the producer path and the consumer path from an external perspective rather than relying solely on broker-internal metrics. Wang presented this pattern at QCon SF 2019 as part of the broader monitoring and tracing infrastructure.

Inca — sampled distributed tracing on a dedicated Kafka cluster

After evaluating and rejecting Zipkin, Netflix built Inca, a message tracing and loss detection system. Trace signals propagate through Kafka record headers. All trace messages are stored on a dedicated Kafka cluster with replication factor 3, min ISR 2, and EBS-backed storage.

The stream processing challenge Inca had to solve was unpredictable trace arrival times: trace records for a single request can arrive out of order and at variable delays. Inca uses Flink's GlobalWindow with custom triggers and treats committed consumer offsets as an external signal to determine when to close windows, rather than relying on event-time watermarks alone. The result is a false-positive loss-detection rate of 0.005%, processing 2 billion trace messages per day.

Iceberg as a Kafka backfill source

When Kafka retention expires, pipelines that need to replay historical data can fall back to Apache Iceberg data warehouse tables. Kafka topic records are persisted to Iceberg, providing a long-retention complement to Kafka's short-retention window. This pattern is described in the Data Mesh blog post as part of the backfill story.

Dead-letter queues on every WAL namespace

Netflix's WAL abstraction automatically provisions a DLQ alongside every Kafka topic it manages. Transient errors are retried with configurable delays; hard errors are parked for manual inspection. The WAL also supports cross-region replication and delay queues for scheduled database writes, without requiring application code to implement these patterns directly.

Operating Kafka at scale

Deployment model

Netflix runs Kafka entirely on AWS, with brokers on EC2 instances and EBS storage for the clusters that require durability (such as Inca's dedicated tracing cluster). The multi-cluster topology is managed by a control-plane service that handles topic creation, partition metadata, and cluster assignment.

Monitoring and observability

Kafka operational metrics flow to Atlas, Netflix's internal dimensional time-series metrics platform, via a custom Kafka MetricReporter. Consumer lag (log end offset minus committed offset per topic-partition) is the primary consumption health signal. SLAs are expressed as message consumption lag, transfer rates, and cross-region replication metrics.

For end-to-end pipeline health, Netflix uses watchdog monitors that track event propagation from producer to final sink, not just broker-level metrics.

Auto-remediation

When stuck-consumer alerts fire, First Responder automatically relaunches the affected stateless streaming jobs without human intervention. This reduces on-call toil and accelerates mean time to recovery for the most common class of streaming job failure.

Autoscaling

Capacity for Flink jobs connected to Kafka is calculated using workload predictions (quadratic or linear regression based on historical throughput) to maintain target processing rates during traffic fluctuations. The Keystone SPaaS reconciliation layer continuously compares desired job state (stored in RDS) against actual state (running Flink jobs) and self-heals divergences.

Schema governance

Schema registration is mandatory before any event can be written to a Kafka topic. The internal Schema Registry enforces this at the platform level. Teams cannot publish to Kafka without a registered Avro schema, which prevents malformed events from entering the pipeline and creates a centralised contract for every data stream.

Retention policies

Retention is set per topic based on throughput and record size. The general approach balances data availability for replay against storage cost. For use cases where data needs to outlast Kafka retention, Iceberg tables serve as the long-term store.

Developer experience

The Streaming SQL layer introduced in 2023 was explicitly designed to lower the bar for non-infrastructure teams. Engineers can write SQL queries to describe transformations; the platform translates them into Flink job graphs. Within a year, teams outside the data infrastructure organisation had created 1,200 SQL processors, indicating that the abstraction achieved meaningful adoption without requiring platform expertise.

Challenges and how they solved them

High-level consumer partition loss (2016)

In Keystone V1.5, Netflix used Kafka's high-level consumer API and observed a known bug where the consumer could lose partition ownership and stop consuming some partitions after running stably for a period. The immediate symptom was data loss, not an error. The solution was moving to Keystone V2's managed routing service with direct partition assignment and monitoring, replacing the high-level consumer with a more explicitly managed consumption model.

Problem: Partition ownership silently dropped under long-running high-level consumers.

Root cause: A known Kafka high-level consumer bug at the time. Solution: Direct partition assignment in the V2 routing layer.

Outcome: Partition ownership became explicit and monitorable.

Data loss from low-durability configurations

Allen Wang identified three categories of data loss in Netflix's production Kafka environment, described at QCon SF 2019:

  1. Replication factor of 2 combined with acks=1 producer configuration. When two brokers hold a partition and the leader fails before replication completes, writes are lost.
  2. Partition leader clock drift causing unexpected log truncation in extreme conditions.
  3. Deployment tooling assigning duplicate consumer group IDs, causing two independent consumer applications to share offsets.

Each was addressed through configuration changes and tooling improvements. The lesson Wang drew: data loss in Kafka rarely comes from Kafka's own logic — it comes from the configuration choices made around it.

Problem: Unexplained data loss in production, hard to attribute to a single cause.

Root cause: Three distinct configuration and tooling mistakes rather than a single failure mode.

Solution: Higher replication factor, acks=all for critical topics, unique consumer group ID enforcement.

Outcome: Data loss incidents reduced; tracing infrastructure (Inca) built to detect any residual loss.

Consumer fan-out at scale

As the number of downstream consumer groups grew, the original single-cluster design could not efficiently serve hundreds of groups with different lag tolerances and throughput requirements. Producers and consumers competed for broker resources.

Problem: Consumer fan-out degraded ingest performance.

Root cause: No isolation between the write path and the read path.

Solution: Two-tier cluster topology: fronting clusters for producers, consumer clusters for consumers, with smart clients routing transparently.

Outcome: Independent scaling of ingest and consumption capacity.

Late-arriving events in stream processing (2017)

When Netflix migrated batch ETL to Flink and Kafka for recommendation model training, late-arriving events — those arriving after their intended event-time window — caused incorrect attribution in the model inputs. The Kafka TTL was 4-6 hours, which was shorter than the arrival delay for some events.

Problem: Late events attributed to the wrong time window, corrupting model training inputs.

Root cause: Event-time windows closed before all relevant events had arrived, and Kafka TTL was shorter than the late-arrival tail.

Solution: Time windowing with a post-processing pass to correctly attribute late events; raw events stored in HDFS for 1-2 days beyond Kafka TTL for replay.

Outcome: Recommendation model training inputs became accurate and replayable. Shriya Arora presented this at QCon New York 2017.

Microservice dependency brittleness (Content Finance)

Netflix Content Finance services experienced cascading failures and inconsistent state when upstream synchronous dependencies became unavailable. Split-brain state between services required manual reconciliation.

Problem: Synchronous RPC calls created hard dependencies between services.

Root cause: No buffering or retry layer between producers and consumers of data changes.

Solution: Kafka-based event-driven architecture with standardised event envelopes, at-least-once delivery, idempotent consumers, and reconciliation events for replay. Flink enriches and reorders events before routing to keyed topics consumed by Spring Boot services.

Outcome: Services became independently deployable and resilient to upstream failures. Nitin Sharma presented this at Kafka Summit SF 2019.

Tudum cache invalidation lag (2025)

The Tudum fan site used a CQRS pattern with Kafka to decouple CMS ingestion from the Cassandra read database. An ingestion service published read-optimised content to a Kafka topic; a page data service consumed it for storage.

In practice, the CQRS-with-Kafka architecture introduced cache refresh delays. Editors could not preview content updates for up to 60 seconds per key, which created operational friction for a content-heavy site where updates are frequent.

Problem: Cache invalidation lag of up to 60 seconds per key.

Root cause: CQRS overhead was not justified by the write volume — Tudum is a read-heavy, low-write-volume site.

Solution: Replaced Kafka and CQRS with RAW Hollow, Netflix's in-memory object store, which handles the read-heavy pattern more directly.

Outcome: Simplified architecture with lower latency for content updates. This is a notable documented case where Netflix actively chose to move away from Kafka when a simpler alternative was more appropriate for the use case. Eugene Yemelyanau and Jake Grice documented this on the Netflix TechBlog in July 2025.

Full tech stack

Category Tools Notes
Message broker Apache Kafka Central message bus for event ingest, inter-service messaging, Data Mesh connective tissue, WAL queue, and monitoring event transport
Stream processing Apache Flink All stateless and stateful jobs in Keystone SPaaS and Data Mesh; window sizes from seconds to hours
Serialisation Apache Avro Mandatory serialisation format for all Kafka events; ~3-5x smaller than JSON
Schema registry Internal Netflix Schema Registry Schema registration required before any event can be written; enforces schema-at-producer
Storage sinks Amazon S3, Apache Iceberg S3 for cold storage and raw event backup; Iceberg for structured backfill when Kafka retention has expired
Database / profile store Apache Cassandra Member taste profile storage, fed by Flink jobs consuming Kafka streams
Search index Elasticsearch Search index sink from Keystone routing; also used in Netflix Studio Search
Real-time analytics Apache Druid Used alongside Kafka in the live streaming monitoring stack
Operational stream processing Mantis Netflix's reactive stream processing platform; used in live streaming monitoring stack
Metrics platform Atlas Netflix's internal dimensional time-series metrics platform; receives Kafka operational metrics via custom MetricReporter
Distributed tracing Inca Netflix-built message tracing and loss detection on a dedicated Kafka cluster; false-positive detection rate 0.005%
Job state / reconciliation Amazon RDS Single source of truth for Keystone SPaaS job desired state; drives reconciliation protocol
Metadata / coordination Apache ZooKeeper Kafka cluster metadata management in the Keystone era
Deployment Spinnaker Deployment orchestration for Flink jobs and Kafka-connected services
Alternative message queue Amazon SQS Used alongside Kafka in the WAL system for specific namespace configurations
Consumer application framework Spring Boot, Spring Cloud Kafka Consumer application framework used in Content Finance Engineering
In-memory object store RAW Hollow Replaced Kafka-based CQRS for Tudum — more suitable for read-heavy, low-write workloads

Key contributors

Name Role Contribution
Allen Wang Architect, Real Time Data Infrastructure, Netflix Architected multi-cluster Kafka infrastructure; contributed rack-aware partition assignment to Apache Kafka; spoke at Surge 2016, QCon London 2018, QCon SF 2019, Kafka Summit SF 2017; authored Inca blog post (2019)
Monal Daxini Lead, Stream Processing, Real Time Data Infrastructure, Netflix Led Keystone SPaaS; presented at Flink Forward 2016, Flink Forward 2017, AWS re:Invent 2017; co-authored Evolution of the Netflix Data Pipeline (2016)
Shriya Arora Senior Data Engineer, Netflix Presented "Migrating Batch ETL to Stream Processing" at QCon New York 2017; described late-arriving event handling and Kafka/Flink migration rationale
Nitin Sharma Content Finance Infrastructure, Netflix Authored "How Netflix Uses Kafka for Distributed Streaming" (Confluent blog, 2020); presented "Eventing Things — A Netflix Original!" at Kafka Summit SF 2019
Zhenzhong Xu Founding engineer, Real Time Data Infrastructure; later led Stream Processing Engines, Netflix Joined 2015; co-authored Evolution of the Netflix Data Pipeline (2016); published "The Four Innovation Phases of Netflix's Trillions Scale Real-time Data Infrastructure" (Medium, 2022)
Prudhviraj Karumanchi and Vidhya Arvind Staff Software Engineers, Data Platform, Netflix Co-authored "Building a Resilient Data Platform with Write-Ahead Log at Netflix" (September 2025); presented at QCon SF 2024
Eugene Yemelyanau and Jake Grice Technology Evangelist / Staff Engineer, Netflix Co-authored "Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow" (July 2025)

Key takeaways for your own Kafka implementation

Netflix's Kafka architecture offers several decisions worth considering when you are planning or scaling your own implementation.

  • Separate your ingest and consumption clusters early. Netflix's split between fronting clusters (producers) and consumer clusters (consumers) addressed a real problem: consumer fan-out degraded producer ingest. If your consumer group count is growing, this separation is worth evaluating before it becomes a performance issue rather than after.
  • Treat configuration choices as a first-class durability concern. Allen Wang's account of data loss at Netflix is instructive because none of the three root causes were Kafka bugs. They were configuration and tooling decisions: low replication factor, weak ack settings, and duplicate consumer group IDs. Auditing these is lower-effort than building additional reliability infrastructure.
  • Enforce schema-at-producer, not schema-at-consumer. Requiring all events to use registered Avro schemas before they reach a topic prevents malformed data from entering the pipeline. Discovering schema issues downstream is significantly more expensive than preventing them at the source.
  • Build external health checks, not only internal ones. Netflix's heartbeat monitoring, which writes and reads from every Kafka cluster via an external service, verifies both the producer and consumer paths independently of broker metrics. Broker health and pipeline health are not the same thing.
  • Revisit architectural choices when use case assumptions change. Netflix's decision to replace Kafka-based CQRS on Tudum with an in-memory store is a useful reminder that Kafka is not always the right tool. The team documented that CQRS overhead was not justified by the actual write volume. Matching the architecture to the access pattern is more important than consistency for its own sake.

Sources and further reading

Primary sources

Kpow

If you are operating Kafka at scale and want full visibility into consumer lag, topic throughput, and broker health from a single interface, give Kpow a try. You can connect it to any Kafka cluster in minutes and get a free 30-day trial with full access — deploy via Docker, Helm, or JAR.