
How Pinterest uses Apache Kafka in production
Table of contents
Pinterest runs one of the largest Kafka deployments in production: more than 3,000 brokers across 50+ clusters, handling 40 million messages per second at inbound peak and over 1.2 petabytes of data per day. What makes the story worth studying is not the scale alone, but the engineering that Pinterest built around it — a custom management platform, a byte-level replication enhancement, a broker-decoupled tiered storage sidecar, and eventually a second PubSub system that uses Kafka as its own internal notification queue.
Kafka sits at the center of Pinterest's data infrastructure: it carries user activity events from hundreds of millions of users, drives real-time advertising budget enforcement, feeds machine learning signal pipelines over 300 billion ideas, and now underpins a next-generation CDC ingestion system that reduced data latency from 24 hours to under 15 minutes.
Company overview
Pinterest is a visual discovery platform with hundreds of millions of active users who save, search, and engage with ideas across categories ranging from home design to food to fashion. The platform's core data challenge is connecting users to relevant content at scale, which requires near-real-time understanding of what users are doing across a corpus of 300 billion ideas.
Pinterest began building its Kafka-based logging infrastructure around 2014, when the engineering team evaluated open-source logging alternatives and started work on Singer, an internal logging agent that would eventually ship over one trillion messages per day to Kafka. The first public description of the Kafka pipeline appeared in February 2015, describing a chain from Singer through Kafka into Secor (a log persistence service) and then to S3 for offline analytics via Spark.
By November 2018, Pinterest was running 2,000+ brokers, handling 800 billion messages per day at a peak rate of 15 million messages per second. By December 2020, those figures had grown to 3,000+ brokers across 50+ clusters with a peak inbound throughput of 40 million messages per second.
Pinterest's Kafka use cases
Kafka at Pinterest spans data transport, real-time analytics, CDC, ML signal pipelines, and spam detection. Different engineering teams own different parts of the pipeline, which gives a clear picture of how deeply Kafka is embedded across the organization.
Event streaming and data warehouse transport
The foundational use case is shipping user activity events (impressions, clicks, close-ups, repins) from application servers into the data warehouse. Singer, Pinterest's open-source logging agent, runs as a daemonset on tens of thousands of hosts and writes over one trillion messages per day to Kafka with at-least-once delivery and sub-5ms upload latency. From Kafka, Secor persists events to S3 for offline analytics and ML training pipelines. This is the base layer that most other pipelines build on.
Real-time advertising budget computation
The ads platform uses a Kafka-backed streaming pipeline to compute and enforce advertiser spend limits in real time. This requires low latency and strict ordering guarantees, making Kafka the natural fit rather than the cost-optimized MemQ system Pinterest later built for ML data.
Fresh content indexing and recommendation
Kafka events are consumed by Flink and Spark Streaming to keep the content index up to date with newly published Pins. This drives the recommendation system's awareness of new content as it is created.
Spam detection (Guardian)
The Trust and Safety team taps user activity events from Kafka and passes them through Guardian, a custom real-time analytics and rules engine built on a columnar query model. A separate Flink-based pipeline also processes these events for spam filtering.
Visual signals pipeline (Kappa architecture)
The Content Acquisition and Media Platform team rebuilt their visual signals infrastructure around Kafka and Flink in a Kappa architecture. The pipeline ingests data across 50 ingestion pipelines covering 300 billion ideas. Apache Flink does stateful stream processing over image and video signals, replacing the previous Lambda architecture that maintained both batch and streaming code paths.
Real-time experiment analytics
Pinterest runs hundreds of concurrent product experiments. Previously, detecting a statistically significant drop in an experiment metric required 10+ hours of batch processing. A Kafka-based Flink pipeline — described as Pinterest's first Flink application in production when it launched in October 2019 — brings detection time down to minutes. The pipeline reads from two Kafka topics (filtered_events and filtered_experiment_activations) and supports 200-300 concurrent experiment groups.
Change data capture (CDC)
Two generations of CDC pipelines use Kafka as the transport layer. An earlier system uses Maxwell to capture incremental database changes from MySQL and publish them to Kafka for downstream ad systems. A more recent system, published in February 2026, captures changes from MySQL, TiDB, and KVStore using Debezium and TiCDC, writes them to Kafka with under-one-second latency, and then processes them through Flink into Apache Iceberg tables on S3. Base tables refresh every 15 minutes to 1 hour, compared to 24+ hours with the previous full-table batch approach.
ML training data transport (MemQ)
For ML training data, where latency requirements are relaxed, Pinterest built MemQ: a cloud-native PubSub system that uses S3 micro-batching and achieves up to 90% cost savings compared to an equivalent Kafka deployment. MemQ handles the high-volume, latency-tolerant workloads that Kafka handles less efficiently. Notably, MemQ uses Kafka internally as its own notification queue.
Scale and throughput
As of the December 2020 snapshot published by the Logging Platform team:
Between November 2018 and December 2020, inbound peak throughput grew from 15 million to 40 million messages per second, driven by the addition of visual signal pipelines, the real-time experiment analytics system, and ongoing user growth.
Pinterest's Kafka architecture
Deployment model
Pinterest runs Kafka entirely on Amazon Web Services, self-managed across three regions: us-east-1 (the primary region, carrying the majority of brokers), us-east-2, and eu-west-1. The cluster fleet runs on EC2 instances. As of the 2018 snapshot, the default instance type was d2.2xlarge with magnetic storage; high-fanout read clusters used d2.8xlarge. By 2020, the team had migrated to SSD-backed instances after identifying magnetic disk I/O wait as a bottleneck during broker replacement events.
Brokers in each cluster are spread across three availability zones, with topics assigned to specific clusters (the "brokerset" concept described below) to limit the blast radius of a cluster failure.
Brokerset: virtual clusters within physical clusters
Rather than mapping topics to individual brokers, Pinterest uses a "brokerset" concept: a static partition assignment group that functions as a virtual Kafka cluster within a physical cluster. Topics are assigned to a brokerset, and partition leaders are placed using a stride-based algorithm that ensures balanced distribution across availability zones. This means that a single physical cluster can host multiple isolated workloads, and a failure in one brokerset does not affect topics assigned to others.
The maximum cluster size is capped at 200 brokers. Clusters beyond a certain size become difficult to operate, so Pinterest provisions new clusters rather than growing existing ones past this limit.
Data pipeline flow
The canonical pipeline runs: application servers → Singer (logging agent, running as a daemonset on each host) → Kafka → downstream consumers (Apache Flink, Spark Streaming, Kafka Streams) and S3 via Secor. Schema formats include Apache Thrift and Protocol Buffers.
The CDC pipeline takes a different path: MySQL, TiDB, and KVStore databases → Debezium or TiCDC connectors → Kafka (sub-one-second write latency) → Apache Flink → Apache Iceberg tables on S3 → Apache Spark (upsert Merge Into operations every 15 minutes to 1 hour).
Cross-region replication
Pinterest uses MirrorMaker v1 to replicate data between the three AWS regions. They extended MirrorMaker with a technique they call Shallow Mirror, which skips decompression and recompression by operating at the RecordBatch level. This reduces CPU load on the MirrorMaker fleet by over 80% at peak. Pinterest proposed Shallow Mirror to the Apache Kafka community as KIP-712.
Management platform: Orion
From around 2021, Pinterest replaced its earlier tooling (DoctorKafka and the Kafka Manager CMAK) with Orion, an open-source cluster management and automation platform (available at pinterest/orion on GitHub). Orion handles automated broker replacement, rolling restarts, configuration updates, cluster upgrades, topic creation via a managed wizard with an approval workflow, and end-to-end data lineage tracking. It provides configurable sensor and operator pipelines that abstract operational tasks into composable automations.
PubSub abstraction: PSC
Pinterest introduced PSC (PubSub Client) in 2022 as a unified Java client library that abstracts Kafka and MemQ behind a Topic URI scheme in the format: protocol:/rn:service:environment:cloud_region:cluster:topic. Applications do not reference broker hostnames directly; service discovery is handled by PSC at runtime. By November 2023, over 90% of Java applications and 100% of Flink jobs had migrated to PSC. The platform team uses PSC to push standardized client configurations and to perform automated error remediation across the fleet.
Producer architecture
Producers are required to compress at source. Pinterest observed that using a single-partition partitioner (rather than round-robin) before compression increases compression ratios approximately 2x by improving data locality within each batch. The team standardized inter-broker protocol and log message format versions across all clusters to eliminate unnecessary format conversions.
Consumer architecture
Consumer group management is handled through Orion's topic governance workflow. The real-time experiment analytics Flink job, as one example, runs at parallelism 256 across 8 master nodes and 50 worker nodes (EC2 c5d.9xlarge instances), checkpointing 100 GB of state to HDFS every 5 seconds. PSC handles consumer offset management across both Kafka and MemQ topics through the same interface.
Stream processing
Apache Flink is Pinterest's primary stream processing engine for Kafka, used across visual signal pipelines, experiment analytics, CDC ingestion, and spam detection. Apache Spark Streaming also consumes from Kafka for certain workloads. Kafka Streams is used for monetisation and metrics use cases.
Kafka Connect ecosystem
CDC ingestion uses Debezium (for MySQL) and TiCDC (for TiDB) as Kafka source connectors. An earlier CDC system used Maxwell for MySQL changelog capture. Secor functions as a custom Kafka-to-S3 sink for log persistence.
Special techniques and engineering innovations
AZ-aware partitioning
Pinterest built a custom Kafka partitioner that directs producer messages and consumer reads to partition leaders located in the same AWS availability zone as the client. The partitioner uses the EC2 Metadata API to determine the client's AZ, then cross-references Kafka's rack awareness metadata to identify local partition leaders. If AZ metadata is unavailable, it falls back to selecting from all available workers. The result is a 25% reduction in cross-AZ data transfer costs. A companion custom S3 transporter partitioner applies the same logic for Secor-based log writes.
Shallow Mirror
Standard MirrorMaker v1 deserializes every RecordBatch and recompresses it before forwarding, which drives high CPU utilization and causes out-of-memory errors at Pinterest's peak traffic. Shallow Mirror bypasses this by iterating over RecordBatches as byte buffer pointers rather than deserializing them. The implementation required solving three problems: correctly updating the BaseOffset field in each batch without full deserialization, handling the "first batch problem" where MirrorMaker crops the first batch in a fetch response, and maintaining throughput for small-batch high-frequency producers. Deployed in production in 2020, Shallow Mirror reduced MirrorMaker CPU load by over 80% at peak and was proposed to the Apache Kafka community as KIP-712.
Graph-algorithm-based leader rebalancing
When partition leadership becomes imbalanced across brokers, naively swapping leaders can still leave some brokers overloaded. Pinterest's Logging Platform team published a two-part solution. Part 1 uses breadth-first search to find single-step leader swap sequences that improve balance. Part 2 models the rebalancing problem as a maximum-flow network to compute optimal swap sequences for more complex imbalances. Critically, both approaches move only partition leaders, not partition data, so the rebalancing operation has a very low cost compared to partition reassignment.
Broker-decoupled tiered storage
Rather than adding tiered storage inside the broker process (the approach taken by KIP-405, the native Kafka tiered storage implementation), Pinterest implemented a sidecar architecture. A Segment Uploader process runs alongside each broker, watching the broker's log directories via filesystem watchers. When a segment is finalized, the Segment Uploader uploads it to S3 and tracks progress via offset.wm files. ZooKeeper is used for partition leadership detection to avoid duplicate uploads. S3 object keys use MD5 hash-based prefix entropy to distribute request load and avoid S3 hotspots.
Consumers can be configured in four modes: Remote Only, Kafka Only, Remote Preferred, and Kafka Preferred. The team extended log.segment.delete.delay.ms from the default 60 seconds to 5 minutes for low-traffic topics to avoid missed segment uploads during retention cleanup. Since May 2024, the system has processed approximately 200 TB of data per day across 20+ onboarded topics.
Static membership (KIP-345)
In cloud environments, Kafka Streams consumers restart frequently for reasons unrelated to actual consumer failure: host restarts, rolling deployments, spot instance preemptions. Each restart triggers a full group rebalance and RocksDB state rebuild. Pinterest engineer Liquan Pei collaborated with Confluent's Boyang Chen on the static membership protocol, presented at Kafka Summit SF 2019 and merged into Kafka as KIP-345. Static membership lets consumers re-join a group without triggering a full rebalance by using a persistent instance ID.
Upstream Kafka contributions
Pinterest's Logging Platform team has contributed several KIPs and patches to the Apache Kafka project: KIP-91 (the delivery.timeout.ms producer configuration), KIP-245 (passing Properties objects to the KafkaStreams constructor), KIP-276 and KIP-300 (precursors to KIP-345), KAFKA-6896 (producer and consumer metrics in Kafka Streams), and KAFKA-7023 / KAFKA-7103 (RocksDB bulk loading optimizations for Kafka Streams state stores). Vahid Hashemian from the Logging Platform team is an Apache Kafka Committer and PMC Member.
Operating Kafka at scale
Automated cluster healing: DoctorKafka (2017-2020)
The first automation Pinterest built was DoctorKafka (open-sourced in August 2017 at pinterest/DoctorK), which monitored broker metrics via a dedicated Kafka topic, detected broker failures, and automatically reassigned workload. Broker failures were handled almost daily at Pinterest's scale; DoctorKafka reduced failure-related alerts by over 95%. To prevent cascading replica loss when multiple brokers failed simultaneously (given a replication factor of 3), DoctorKafka rate-limited broker replacements to one per time period. The system also analyzed 24-48 hours of historical broker statistics and treated network bandwidth as the primary resource metric for rebalancing decisions.
Orion (2020 onward)
Orion replaced DoctorKafka and CMAK as Pinterest's primary cluster management interface. It provides automated broker replacement, rolling upgrades, configuration updates, and cluster provisioning. Topic creation goes through a managed wizard with an approval workflow inside Orion, and lineage is tracked end-to-end from producer to consumer. Orion is open-sourced at pinterest/orion.
Client configuration standardization via PSC
Before PSC, applications maintained direct Kafka client dependencies with hardcoded broker hostnames and SSL passwords scattered across configurations. This created outage exposure during maintenance events. PSC introduced URI-based service discovery, standardized client configurations delivered by the platform team, and automated error remediation for common remediable exceptions. The result was over an 80% reduction in Flink application restarts caused by remediable client exceptions, and approximately 275 fewer FTE hours per year in keep-the-lights-on work.
Kafka version management
As of 2018, Pinterest maintained a monthly cadence for tracking open-source Kafka release branches. By 2020, the fleet ran on version 2.3.1 with cherry-picked fixes. A key operational improvement was standardizing the inter-broker protocol version and log message format version across all clusters, which eliminated the overhead of unnecessary in-flight message format conversions during rolling upgrades.
Broker configuration
Broker heap size is set to 8 GB, increased from 4 GB after TLS was enabled (TLS adds approximately 122 KB of memory per SSL KafkaChannel at scale). Pinterest evaluated EBS st1 volumes as a storage alternative to local disks but found d2 local storage superior for their access patterns; later, they migrated from magnetic to SSD-backed instances after identifying I/O wait as the bottleneck during broker replacement.
Upgrade validation
Before rolling out a Kafka version upgrade, Pinterest constructs a parallel test pipeline that mirrors the topology of the target production pipeline. Double publishing sends data through both old and new infrastructure simultaneously, and the team compares throughput, latency, and error metrics before promoting the new version to production.
Challenges and how they solved them
Daily broker failures at cloud scale (2017)
At 1,000+ brokers, hardware failures were routine. Partition reassignment alone was not fast enough to handle degraded brokers, and manual intervention did not scale. The solution was DoctorKafka: automated detection of broker degradation via metrics published to a Kafka topic, followed by automatic workload reassignment. Rate-limiting was critical: without it, simultaneous failures of multiple brokers could exceed the replication factor, causing data loss. The result was over a 95% reduction in failure-related operational alerts.
Cross-AZ data transfer costs (2019)
Default Kafka producer and consumer behavior assigns clients to any available partition leader regardless of availability zone. At Pinterest's scale, this generated substantial cross-AZ network transfer costs. The team built a custom partitioner using the EC2 Metadata API and Kafka rack awareness to route producers and consumers to partition leaders in their own AZ. The custom S3 transporter partitioner applied the same logic for Secor writes. Result: 25% reduction in cross-AZ transfer costs.
MirrorMaker CPU spikes during cross-region replication (2020)
Standard MirrorMaker v1 deserializes and recompresses every RecordBatch during cross-region replication. At Pinterest's throughput, this caused sustained CPU spikes and out-of-memory errors during peak traffic. The Shallow Mirror implementation bypassed serialization by treating RecordBatches as opaque byte buffers, requiring custom handling for BaseOffset modification and the batch-cropping issue introduced by MirrorMaker's fetch response handling. The result was over an 80% reduction in MirrorMaker CPU load.
Imbalanced partition leadership causing broker overload (2020)
As the cluster fleet grew, partition leadership distribution became uneven, overloading individual brokers. Naive leader swap approaches could move leaders without materially improving the overall balance. Pinterest modeled the rebalancing problem using graph algorithms: BFS for single-step swaps, and maximum-flow networks for general multi-step rebalancing. Both approaches operate only on leader assignments, not on partition data, keeping the operation lightweight.
Magnetic disk I/O bottleneck during broker replacement (pre-2020)
On d2.2xlarge instances with magnetic storage, brokers undergoing replacement showed 10-30% CPU I/O wait during the recovery process. Migrating to SSD-backed instances reduced I/O wait to under 0.1% at the p100 level.
Client misconfiguration and tight coupling to broker endpoints (pre-2022)
Before PSC, applications embedded Kafka broker hostnames, SSL passwords, and client configurations directly. Maintenance events that changed broker endpoints required coordinated client restarts. PSC addressed this by introducing URI-based topic addressing with server-side service discovery, platform-managed client configurations, and automated remediation for known failure modes.
Kafka unsuitable for ML training data transport (2018-2020)
Kafka's strict ordering guarantees, partition rigidity, rebalancing overhead, and replication costs made it an expensive fit for high-volume ML training data pipelines, where sub-second latency is not required. The solution was MemQ: a cloud-native PubSub system using S3 micro-batching that achieves up to 90% cost reduction versus an equivalent Kafka deployment. MemQ handles latency-tolerant ML workloads while Kafka continues to handle all use cases that require strict ordering and low latency. MemQ uses Kafka as its internal notification queue.
Legacy batch database ingestion taking 24+ hours (pre-2026)
Full-table batch reprocessing reprocessed over 95% of unchanged data on every cycle, creating a 24-hour data latency and unnecessary compute cost. The next-generation CDC pipeline (Debezium/TiCDC → Kafka → Flink → Apache Iceberg) processes only the roughly 5% of records that change in a given day. CDC events land in Kafka within under one second. Base tables on S3 refresh every 15 minutes to 1 hour. Bucket partitioning by primary key hash, combined with bucket joins in downstream Spark queries, delivered a 40%+ reduction in compute cost.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
Design for cluster isolation from the start. Pinterest's brokerset concept — virtual clusters with stride-based partition assignment within a physical cluster — gives you topic-level failure isolation without the operational overhead of provisioning a separate physical cluster for every workload. The key is capping cluster size (Pinterest uses 200 brokers) and using the brokerset as the unit of risk management.
AZ-aware routing pays for itself at scale. Default Kafka clients are AZ-agnostic, and at a large enough fleet the cross-AZ data transfer costs are non-trivial. Pinterest built a custom partitioner using EC2 Metadata and Kafka rack awareness that routes producers and consumers to local partition leaders, achieving a 25% reduction in transfer costs. This is a relatively low-effort change for the savings it produces in cloud environments.
Replication overhead is often decompression, not network. Pinterest found that MirrorMaker's CPU cost was dominated by decompression and recompression, not by network I/O. The Shallow Mirror approach — treating RecordBatches as opaque byte buffers during replication — reduced MirrorMaker CPU load by over 80%. Before investing in replication hardware, it is worth profiling whether the bottleneck is actually the wire or the processing.
Broker-side tiered storage is not the only option. Pinterest deliberately chose a sidecar architecture for tiered storage rather than waiting for KIP-405's in-broker implementation. The sidecar Segment Uploader is operationally independent of the broker process, which simplifies upgrades and means a broker restart does not interrupt the upload pipeline. If tiered storage is a priority, evaluate whether broker-coupled or broker-decoupled implementations better fit your upgrade cadence and failure domains.
A unified client abstraction reduces operational debt significantly. PSC gave Pinterest's platform team a single control point for client configurations, service discovery, and automated error remediation across both Kafka and MemQ. The result was over 80% fewer Flink restarts from client errors and an estimated 275 fewer FTE hours per year in KTLO work. A thin abstraction layer that separates application code from broker topology is worth the investment when you operate at more than a handful of clusters.
Sources and further reading
- How Pinterest runs Kafka at scale - Henry Cai, Shawn Nguyen, Yi Yin, Liquan Pei et al., Pinterest Engineering Blog, 2018
- Real-time analytics at Pinterest - Krishna Gade, Pinterest Engineering Blog, 2015
- Running Kafka at scale at Pinterest - Eric Lopez, Heng Zhang, Henry Cai, Jeff Xiang et al., Confluent Engineering Blog, 2021
- Pinterest tiered storage for Apache Kafka: a broker-decoupled approach - Jeff Xiang, Vahid Hashemian, Pinterest Engineering Blog, 2024
- Open-sourcing DoctorKafka: Kafka cluster healing and workload balancing - Yu Yang, Pinterest Engineering Blog, 2017
- Using graph algorithms to optimize Kafka operations, Part 1 - Ping-Min Lin, Pinterest Engineering Blog, 2020
- Shallow Mirror - Henry Cai, Pinterest Engineering Blog, 2021
- Fighting spam with Guardian: a real-time analytics and rules engine - Hongkai Pan, Pinterest Engineering Blog, 2021
- Pinterest visual signals infrastructure: evolution from Lambda to Kappa architecture - Ankit Patel, Pinterest Engineering Blog, 2020
- Real-time experiment analytics at Pinterest using Apache Flink - Parag Kesar, Ben Liu, Pinterest Engineering Blog, 2019
- MemQ: an efficient, scalable cloud-native PubSub system - Ambud Sharma, Pinterest Engineering Blog, 2021
- Unified PubSub Client at Pinterest - Vahid Hashemian, Jeff Xiang, Pinterest Engineering Blog, 2022
- Open-sourcing Singer: Pinterest's performant and reliable logging agent - Ambud Sharma, Indy Prentice, Henry Cai, Shawn Nguyen et al., Pinterest Engineering Blog, 2019
- pinterest/orion on GitHub - Pinterest Engineering
- Next-generation DB ingestion at Pinterest - Liang Mou, Yisheng Zhou, Elizabeth Nguyen, Owen Zhang, Pinterest Engineering Blog, 2026
- Running Unified PubSub Client in production at Pinterest - Jeff Xiang, Vahid Hashemian, Jesus Zuniga, Pinterest Engineering Blog, 2023
- Optimizing Kafka for the cloud - Eric Lopez, Henry Cai, Heng Zhang, Ping-Min Lin et al., Pinterest Engineering Blog, 2019
- Static membership rebalance strategy designed for the cloud (Kafka Summit SF 2019) - Liquan Pei (Pinterest), Boyang Chen (Confluent), Kafka Summit SF 2019
If you are managing a Kafka deployment at scale, Kpow gives you a single interface for monitoring consumer lag, inspecting topics, and managing cluster operations across multiple clusters. A free 30-day trial is available and connects to any Kafka cluster in minutes via Docker, Helm, or JAR.