Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Pinterest uses Apache Kafka in production

Table of contents

Factor House
May 16th, 2026
xx min read

Pinterest runs one of the largest Kafka deployments in production: more than 3,000 brokers across 50+ clusters, handling 40 million messages per second at inbound peak and over 1.2 petabytes of data per day. What makes the story worth studying is not the scale alone, but the engineering that Pinterest built around it — a custom management platform, a byte-level replication enhancement, a broker-decoupled tiered storage sidecar, and eventually a second PubSub system that uses Kafka as its own internal notification queue.

Kafka sits at the center of Pinterest's data infrastructure: it carries user activity events from hundreds of millions of users, drives real-time advertising budget enforcement, feeds machine learning signal pipelines over 300 billion ideas, and now underpins a next-generation CDC ingestion system that reduced data latency from 24 hours to under 15 minutes.

Company overview

Pinterest is a visual discovery platform with hundreds of millions of active users who save, search, and engage with ideas across categories ranging from home design to food to fashion. The platform's core data challenge is connecting users to relevant content at scale, which requires near-real-time understanding of what users are doing across a corpus of 300 billion ideas.

Pinterest began building its Kafka-based logging infrastructure around 2014, when the engineering team evaluated open-source logging alternatives and started work on Singer, an internal logging agent that would eventually ship over one trillion messages per day to Kafka. The first public description of the Kafka pipeline appeared in February 2015, describing a chain from Singer through Kafka into Secor (a log persistence service) and then to S3 for offline analytics via Spark.

By November 2018, Pinterest was running 2,000+ brokers, handling 800 billion messages per day at a peak rate of 15 million messages per second. By December 2020, those figures had grown to 3,000+ brokers across 50+ clusters with a peak inbound throughput of 40 million messages per second.

Date Event
2014 Pinterest begins evaluating open-source logging alternatives; Singer development starts
2015-02 First public description of Kafka-based real-time analytics pipeline (Singer → Kafka → Secor → S3 → Spark/MemSQL)
2017-08 DoctorKafka open-sourced (pinterest/DoctorK); managing 1,000+ brokers at the time
2018-11 Scale snapshot: 2,000+ brokers, 800 billion messages/day, 1.2 PB/day, 15 million messages/sec peak
2019-04 AZ-aware partitioning deployed, delivering 25% reduction in cross-AZ transfer costs
2019-06 Singer open-sourced (pinterest/singer)
2019-10 Real-time experiment analytics pipeline with Flink reaches production — described as Pinterest's first Flink application in production
2019 Formal evaluation of PubSub scalability alternatives begins; leads to MemQ design
2020 Shallow Mirror deployed in production; major Kafka upgrade to 2.3.1 completed; broker fleet migrated from magnetic to SSD storage
2020-mid MemQ enters production
2020-12 Scale snapshot: 50+ clusters, 3,000+ brokers, 3,000+ topics, ~500,000 partitions, 40 million+ messages/sec inbound peak
2021-11 MemQ announced publicly: up to 90% cheaper than equivalent Kafka deployment for ML data transport
2022-03 Unified PubSub Client (PSC) announced; Kafka and MemQ abstracted behind Topic URIs
2023-11 PSC in production at scale: over 90% Java app migration, 100% Flink migration, over 80% reduction in Flink restarts
2024-05 Broker-decoupled tiered storage enters production (20+ topics, ~200 TB/day offloaded to S3)
2026-02 Next-generation CDC ingestion pipeline published: Debezium/TiCDC → Kafka → Flink → Iceberg reduces data latency from 24+ hours to 15 minutes across thousands of pipelines

Pinterest's Kafka use cases

Kafka at Pinterest spans data transport, real-time analytics, CDC, ML signal pipelines, and spam detection. Different engineering teams own different parts of the pipeline, which gives a clear picture of how deeply Kafka is embedded across the organization.

Event streaming and data warehouse transport

The foundational use case is shipping user activity events (impressions, clicks, close-ups, repins) from application servers into the data warehouse. Singer, Pinterest's open-source logging agent, runs as a daemonset on tens of thousands of hosts and writes over one trillion messages per day to Kafka with at-least-once delivery and sub-5ms upload latency. From Kafka, Secor persists events to S3 for offline analytics and ML training pipelines. This is the base layer that most other pipelines build on.

Real-time advertising budget computation

The ads platform uses a Kafka-backed streaming pipeline to compute and enforce advertiser spend limits in real time. This requires low latency and strict ordering guarantees, making Kafka the natural fit rather than the cost-optimized MemQ system Pinterest later built for ML data.

Fresh content indexing and recommendation

Kafka events are consumed by Flink and Spark Streaming to keep the content index up to date with newly published Pins. This drives the recommendation system's awareness of new content as it is created.

Spam detection (Guardian)

The Trust and Safety team taps user activity events from Kafka and passes them through Guardian, a custom real-time analytics and rules engine built on a columnar query model. A separate Flink-based pipeline also processes these events for spam filtering.

Visual signals pipeline (Kappa architecture)

The Content Acquisition and Media Platform team rebuilt their visual signals infrastructure around Kafka and Flink in a Kappa architecture. The pipeline ingests data across 50 ingestion pipelines covering 300 billion ideas. Apache Flink does stateful stream processing over image and video signals, replacing the previous Lambda architecture that maintained both batch and streaming code paths.

Real-time experiment analytics

Pinterest runs hundreds of concurrent product experiments. Previously, detecting a statistically significant drop in an experiment metric required 10+ hours of batch processing. A Kafka-based Flink pipeline — described as Pinterest's first Flink application in production when it launched in October 2019 — brings detection time down to minutes. The pipeline reads from two Kafka topics (filtered_events and filtered_experiment_activations) and supports 200-300 concurrent experiment groups.

Change data capture (CDC)

Two generations of CDC pipelines use Kafka as the transport layer. An earlier system uses Maxwell to capture incremental database changes from MySQL and publish them to Kafka for downstream ad systems. A more recent system, published in February 2026, captures changes from MySQL, TiDB, and KVStore using Debezium and TiCDC, writes them to Kafka with under-one-second latency, and then processes them through Flink into Apache Iceberg tables on S3. Base tables refresh every 15 minutes to 1 hour, compared to 24+ hours with the previous full-table batch approach.

ML training data transport (MemQ)

For ML training data, where latency requirements are relaxed, Pinterest built MemQ: a cloud-native PubSub system that uses S3 micro-batching and achieves up to 90% cost savings compared to an equivalent Kafka deployment. MemQ handles the high-volume, latency-tolerant workloads that Kafka handles less efficiently. Notably, MemQ uses Kafka internally as its own notification queue.

Scale and throughput

As of the December 2020 snapshot published by the Logging Platform team:

Metric Value
Production clusters 50+
Brokers 3,000+
Topics 3,000+
Partitions ~500,000
Inbound peak throughput ~25 GB/s (40 million+ messages/second)
Outbound peak throughput ~50 GB/s
Daily messages (from Singer alone) 1 trillion+
Max brokers per cluster 200
Replication factor 3 (across three availability zones)
AWS regions 3 (us-east-1 primary, us-east-2, eu-west-1)
Tiered storage (since May 2024) ~200 TB/day offloaded from broker disk to S3 across 20+ onboarded topics

Between November 2018 and December 2020, inbound peak throughput grew from 15 million to 40 million messages per second, driven by the addition of visual signal pipelines, the real-time experiment analytics system, and ongoing user growth.

Pinterest's Kafka architecture

Deployment model

Pinterest runs Kafka entirely on Amazon Web Services, self-managed across three regions: us-east-1 (the primary region, carrying the majority of brokers), us-east-2, and eu-west-1. The cluster fleet runs on EC2 instances. As of the 2018 snapshot, the default instance type was d2.2xlarge with magnetic storage; high-fanout read clusters used d2.8xlarge. By 2020, the team had migrated to SSD-backed instances after identifying magnetic disk I/O wait as a bottleneck during broker replacement events.

Brokers in each cluster are spread across three availability zones, with topics assigned to specific clusters (the "brokerset" concept described below) to limit the blast radius of a cluster failure.

Brokerset: virtual clusters within physical clusters

Rather than mapping topics to individual brokers, Pinterest uses a "brokerset" concept: a static partition assignment group that functions as a virtual Kafka cluster within a physical cluster. Topics are assigned to a brokerset, and partition leaders are placed using a stride-based algorithm that ensures balanced distribution across availability zones. This means that a single physical cluster can host multiple isolated workloads, and a failure in one brokerset does not affect topics assigned to others.

The maximum cluster size is capped at 200 brokers. Clusters beyond a certain size become difficult to operate, so Pinterest provisions new clusters rather than growing existing ones past this limit.

Data pipeline flow

The canonical pipeline runs: application servers → Singer (logging agent, running as a daemonset on each host) → Kafka → downstream consumers (Apache Flink, Spark Streaming, Kafka Streams) and S3 via Secor. Schema formats include Apache Thrift and Protocol Buffers.

The CDC pipeline takes a different path: MySQL, TiDB, and KVStore databases → Debezium or TiCDC connectors → Kafka (sub-one-second write latency) → Apache Flink → Apache Iceberg tables on S3 → Apache Spark (upsert Merge Into operations every 15 minutes to 1 hour).

Cross-region replication

Pinterest uses MirrorMaker v1 to replicate data between the three AWS regions. They extended MirrorMaker with a technique they call Shallow Mirror, which skips decompression and recompression by operating at the RecordBatch level. This reduces CPU load on the MirrorMaker fleet by over 80% at peak. Pinterest proposed Shallow Mirror to the Apache Kafka community as KIP-712.

Management platform: Orion

From around 2021, Pinterest replaced its earlier tooling (DoctorKafka and the Kafka Manager CMAK) with Orion, an open-source cluster management and automation platform (available at pinterest/orion on GitHub). Orion handles automated broker replacement, rolling restarts, configuration updates, cluster upgrades, topic creation via a managed wizard with an approval workflow, and end-to-end data lineage tracking. It provides configurable sensor and operator pipelines that abstract operational tasks into composable automations.

PubSub abstraction: PSC

Pinterest introduced PSC (PubSub Client) in 2022 as a unified Java client library that abstracts Kafka and MemQ behind a Topic URI scheme in the format: protocol:/rn:service:environment:cloud_region:cluster:topic. Applications do not reference broker hostnames directly; service discovery is handled by PSC at runtime. By November 2023, over 90% of Java applications and 100% of Flink jobs had migrated to PSC. The platform team uses PSC to push standardized client configurations and to perform automated error remediation across the fleet.

Producer architecture

Producers are required to compress at source. Pinterest observed that using a single-partition partitioner (rather than round-robin) before compression increases compression ratios approximately 2x by improving data locality within each batch. The team standardized inter-broker protocol and log message format versions across all clusters to eliminate unnecessary format conversions.

Consumer architecture

Consumer group management is handled through Orion's topic governance workflow. The real-time experiment analytics Flink job, as one example, runs at parallelism 256 across 8 master nodes and 50 worker nodes (EC2 c5d.9xlarge instances), checkpointing 100 GB of state to HDFS every 5 seconds. PSC handles consumer offset management across both Kafka and MemQ topics through the same interface.

Stream processing

Apache Flink is Pinterest's primary stream processing engine for Kafka, used across visual signal pipelines, experiment analytics, CDC ingestion, and spam detection. Apache Spark Streaming also consumes from Kafka for certain workloads. Kafka Streams is used for monetisation and metrics use cases.

Kafka Connect ecosystem

CDC ingestion uses Debezium (for MySQL) and TiCDC (for TiDB) as Kafka source connectors. An earlier CDC system used Maxwell for MySQL changelog capture. Secor functions as a custom Kafka-to-S3 sink for log persistence.

Special techniques and engineering innovations

AZ-aware partitioning

Pinterest built a custom Kafka partitioner that directs producer messages and consumer reads to partition leaders located in the same AWS availability zone as the client. The partitioner uses the EC2 Metadata API to determine the client's AZ, then cross-references Kafka's rack awareness metadata to identify local partition leaders. If AZ metadata is unavailable, it falls back to selecting from all available workers. The result is a 25% reduction in cross-AZ data transfer costs. A companion custom S3 transporter partitioner applies the same logic for Secor-based log writes.

Shallow Mirror

Standard MirrorMaker v1 deserializes every RecordBatch and recompresses it before forwarding, which drives high CPU utilization and causes out-of-memory errors at Pinterest's peak traffic. Shallow Mirror bypasses this by iterating over RecordBatches as byte buffer pointers rather than deserializing them. The implementation required solving three problems: correctly updating the BaseOffset field in each batch without full deserialization, handling the "first batch problem" where MirrorMaker crops the first batch in a fetch response, and maintaining throughput for small-batch high-frequency producers. Deployed in production in 2020, Shallow Mirror reduced MirrorMaker CPU load by over 80% at peak and was proposed to the Apache Kafka community as KIP-712.

Graph-algorithm-based leader rebalancing

When partition leadership becomes imbalanced across brokers, naively swapping leaders can still leave some brokers overloaded. Pinterest's Logging Platform team published a two-part solution. Part 1 uses breadth-first search to find single-step leader swap sequences that improve balance. Part 2 models the rebalancing problem as a maximum-flow network to compute optimal swap sequences for more complex imbalances. Critically, both approaches move only partition leaders, not partition data, so the rebalancing operation has a very low cost compared to partition reassignment.

Broker-decoupled tiered storage

Rather than adding tiered storage inside the broker process (the approach taken by KIP-405, the native Kafka tiered storage implementation), Pinterest implemented a sidecar architecture. A Segment Uploader process runs alongside each broker, watching the broker's log directories via filesystem watchers. When a segment is finalized, the Segment Uploader uploads it to S3 and tracks progress via offset.wm files. ZooKeeper is used for partition leadership detection to avoid duplicate uploads. S3 object keys use MD5 hash-based prefix entropy to distribute request load and avoid S3 hotspots.

Consumers can be configured in four modes: Remote Only, Kafka Only, Remote Preferred, and Kafka Preferred. The team extended log.segment.delete.delay.ms from the default 60 seconds to 5 minutes for low-traffic topics to avoid missed segment uploads during retention cleanup. Since May 2024, the system has processed approximately 200 TB of data per day across 20+ onboarded topics.

Static membership (KIP-345)

In cloud environments, Kafka Streams consumers restart frequently for reasons unrelated to actual consumer failure: host restarts, rolling deployments, spot instance preemptions. Each restart triggers a full group rebalance and RocksDB state rebuild. Pinterest engineer Liquan Pei collaborated with Confluent's Boyang Chen on the static membership protocol, presented at Kafka Summit SF 2019 and merged into Kafka as KIP-345. Static membership lets consumers re-join a group without triggering a full rebalance by using a persistent instance ID.

Upstream Kafka contributions

Pinterest's Logging Platform team has contributed several KIPs and patches to the Apache Kafka project: KIP-91 (the delivery.timeout.ms producer configuration), KIP-245 (passing Properties objects to the KafkaStreams constructor), KIP-276 and KIP-300 (precursors to KIP-345), KAFKA-6896 (producer and consumer metrics in Kafka Streams), and KAFKA-7023 / KAFKA-7103 (RocksDB bulk loading optimizations for Kafka Streams state stores). Vahid Hashemian from the Logging Platform team is an Apache Kafka Committer and PMC Member.

Operating Kafka at scale

Automated cluster healing: DoctorKafka (2017-2020)

The first automation Pinterest built was DoctorKafka (open-sourced in August 2017 at pinterest/DoctorK), which monitored broker metrics via a dedicated Kafka topic, detected broker failures, and automatically reassigned workload. Broker failures were handled almost daily at Pinterest's scale; DoctorKafka reduced failure-related alerts by over 95%. To prevent cascading replica loss when multiple brokers failed simultaneously (given a replication factor of 3), DoctorKafka rate-limited broker replacements to one per time period. The system also analyzed 24-48 hours of historical broker statistics and treated network bandwidth as the primary resource metric for rebalancing decisions.

Orion (2020 onward)

Orion replaced DoctorKafka and CMAK as Pinterest's primary cluster management interface. It provides automated broker replacement, rolling upgrades, configuration updates, and cluster provisioning. Topic creation goes through a managed wizard with an approval workflow inside Orion, and lineage is tracked end-to-end from producer to consumer. Orion is open-sourced at pinterest/orion.

Client configuration standardization via PSC

Before PSC, applications maintained direct Kafka client dependencies with hardcoded broker hostnames and SSL passwords scattered across configurations. This created outage exposure during maintenance events. PSC introduced URI-based service discovery, standardized client configurations delivered by the platform team, and automated error remediation for common remediable exceptions. The result was over an 80% reduction in Flink application restarts caused by remediable client exceptions, and approximately 275 fewer FTE hours per year in keep-the-lights-on work.

Kafka version management

As of 2018, Pinterest maintained a monthly cadence for tracking open-source Kafka release branches. By 2020, the fleet ran on version 2.3.1 with cherry-picked fixes. A key operational improvement was standardizing the inter-broker protocol version and log message format version across all clusters, which eliminated the overhead of unnecessary in-flight message format conversions during rolling upgrades.

Broker configuration

Broker heap size is set to 8 GB, increased from 4 GB after TLS was enabled (TLS adds approximately 122 KB of memory per SSL KafkaChannel at scale). Pinterest evaluated EBS st1 volumes as a storage alternative to local disks but found d2 local storage superior for their access patterns; later, they migrated from magnetic to SSD-backed instances after identifying I/O wait as the bottleneck during broker replacement.

Upgrade validation

Before rolling out a Kafka version upgrade, Pinterest constructs a parallel test pipeline that mirrors the topology of the target production pipeline. Double publishing sends data through both old and new infrastructure simultaneously, and the team compares throughput, latency, and error metrics before promoting the new version to production.

Challenges and how they solved them

Daily broker failures at cloud scale (2017)

At 1,000+ brokers, hardware failures were routine. Partition reassignment alone was not fast enough to handle degraded brokers, and manual intervention did not scale. The solution was DoctorKafka: automated detection of broker degradation via metrics published to a Kafka topic, followed by automatic workload reassignment. Rate-limiting was critical: without it, simultaneous failures of multiple brokers could exceed the replication factor, causing data loss. The result was over a 95% reduction in failure-related operational alerts.

Cross-AZ data transfer costs (2019)

Default Kafka producer and consumer behavior assigns clients to any available partition leader regardless of availability zone. At Pinterest's scale, this generated substantial cross-AZ network transfer costs. The team built a custom partitioner using the EC2 Metadata API and Kafka rack awareness to route producers and consumers to partition leaders in their own AZ. The custom S3 transporter partitioner applied the same logic for Secor writes. Result: 25% reduction in cross-AZ transfer costs.

MirrorMaker CPU spikes during cross-region replication (2020)

Standard MirrorMaker v1 deserializes and recompresses every RecordBatch during cross-region replication. At Pinterest's throughput, this caused sustained CPU spikes and out-of-memory errors during peak traffic. The Shallow Mirror implementation bypassed serialization by treating RecordBatches as opaque byte buffers, requiring custom handling for BaseOffset modification and the batch-cropping issue introduced by MirrorMaker's fetch response handling. The result was over an 80% reduction in MirrorMaker CPU load.

Imbalanced partition leadership causing broker overload (2020)

As the cluster fleet grew, partition leadership distribution became uneven, overloading individual brokers. Naive leader swap approaches could move leaders without materially improving the overall balance. Pinterest modeled the rebalancing problem using graph algorithms: BFS for single-step swaps, and maximum-flow networks for general multi-step rebalancing. Both approaches operate only on leader assignments, not on partition data, keeping the operation lightweight.

Magnetic disk I/O bottleneck during broker replacement (pre-2020)

On d2.2xlarge instances with magnetic storage, brokers undergoing replacement showed 10-30% CPU I/O wait during the recovery process. Migrating to SSD-backed instances reduced I/O wait to under 0.1% at the p100 level.

Client misconfiguration and tight coupling to broker endpoints (pre-2022)

Before PSC, applications embedded Kafka broker hostnames, SSL passwords, and client configurations directly. Maintenance events that changed broker endpoints required coordinated client restarts. PSC addressed this by introducing URI-based topic addressing with server-side service discovery, platform-managed client configurations, and automated remediation for known failure modes.

Kafka unsuitable for ML training data transport (2018-2020)

Kafka's strict ordering guarantees, partition rigidity, rebalancing overhead, and replication costs made it an expensive fit for high-volume ML training data pipelines, where sub-second latency is not required. The solution was MemQ: a cloud-native PubSub system using S3 micro-batching that achieves up to 90% cost reduction versus an equivalent Kafka deployment. MemQ handles latency-tolerant ML workloads while Kafka continues to handle all use cases that require strict ordering and low latency. MemQ uses Kafka as its internal notification queue.

Legacy batch database ingestion taking 24+ hours (pre-2026)

Full-table batch reprocessing reprocessed over 95% of unchanged data on every cycle, creating a 24-hour data latency and unnecessary compute cost. The next-generation CDC pipeline (Debezium/TiCDC → Kafka → Flink → Apache Iceberg) processes only the roughly 5% of records that change in a given day. CDC events land in Kafka within under one second. Base tables on S3 refresh every 15 minutes to 1 hour. Bucket partitioning by primary key hash, combined with bucket joins in downstream Spark queries, delivered a 40%+ reduction in compute cost.

Full tech stack

Category Technology Role
Message broker Apache Kafka Central message bus and event streaming platform for all latency-sensitive data transport
Message broker MemQ (pinterest/memq) Cloud-native PubSub system for latency-tolerant ML training data transport; uses Kafka internally as its notification queue; up to 90% cheaper than equivalent Kafka
Log shipping Singer (pinterest/singer) Open-source logging agent running on tens of thousands of hosts; ships 1 trillion+ messages/day to Kafka with at-least-once delivery and sub-5ms latency
Stream processing Apache Flink Stateful stream processing for real-time experiment analytics, visual signal pipelines (Kappa architecture), and CDC database ingestion
Stream processing Apache Spark Streaming Stream processing consumer of Kafka topics; also handles periodic Merge Into operations on Iceberg tables in the CDC pipeline
Stream processing Kafka Streams In-process stream processing for monetisation and metrics use cases
State backend RocksDB State backend for Kafka Streams jobs; Pinterest contributed bulk loading optimizations (KAFKA-7023, KAFKA-7103)
Cross-region replication MirrorMaker v1 + Shallow Mirror Cross-region Kafka replication; Pinterest's Shallow Mirror extension bypasses decompression for 80%+ CPU reduction
Cluster management Orion (pinterest/orion) Open-source management and automation platform; broker replacement, rolling upgrades, topic governance, lineage tracking
Client abstraction PSC (PubSub Client) Unified Java client library abstracting Kafka and MemQ behind Topic URI-based service discovery; 100% Flink adoption, 90%+ Java app adoption
CDC connectors Debezium MySQL CDC source connector writing change events to Kafka with sub-one-second latency
CDC connectors TiCDC TiDB CDC source connector writing change events to Kafka
CDC connectors Maxwell Earlier MySQL CDC connector used for ad system database changelog capture
Table format Apache Iceberg Table format for CDC and base tables on S3 in the next-generation ingestion pipeline; uses Merge-on-Read strategy
Log persistence Secor Pinterest's Kafka-to-S3 log persistence service
Object storage Amazon S3 Remote storage for tiered storage segments, Iceberg tables, Flink checkpoints (CSV snapshots), and MemQ message storage
Checkpoint storage HDFS Flink checkpoint storage for experiment analytics jobs
Coordination ZooKeeper Kafka coordination; also used by the tiered storage sidecar for partition leadership detection
Compute Amazon EC2 Kafka broker hosting (d2.2xlarge default, d2.8xlarge for high-fanout, later SSD instances); Flink worker nodes (c5d.9xlarge for experiment analytics)
Serialization Apache Thrift / Protocol Buffers Event serialization formats for Singer and Kafka producers
Client libraries librdkafka, kafka-python, confluent-kafka-python Kafka clients for non-Java (C++) and Python services
Source databases MySQL, TiDB, KVStore CDC source databases feeding the next-generation ingestion pipeline via Debezium and TiCDC

Key contributors

Name Title / team Contribution
Ambud Sharma Tech Lead and Engineering Manager, Logging Platform Led MemQ design and open-source release; co-authored AZ-aware partitioning and Singer open-source posts; overall Logging Platform leadership
Vahid Hashemian Staff Software Engineer, Logging Platform (Apache Kafka Committer and PMC Member) Co-authored "How Pinterest runs Kafka at scale" (2018) and Confluent blog post (2021); co-authored PSC articles (2022, 2023); upstream Kafka contributions including KIP-91 and KIP-245
Henry Cai Software Engineer, Data Engineering Co-authored "How Pinterest runs Kafka at scale" (2018); lead author of the Shallow Mirror article (2021); co-authored AZ-aware partitioning and Singer open-source posts
Jeff Xiang Senior Software Engineer, Logging Platform Lead author of the tiered storage article (2024); co-authored PSC articles (2022, 2023)
Ping-Min Lin Software Engineer, Logging Platform Authored graph algorithms for Kafka operations (Parts 1 and 2, 2020); MemQ performance optimization; co-authored AZ-aware partitioning post
Liquan Pei Software Engineer Co-authored "How Pinterest runs Kafka at scale" (2018); co-presented KIP-345 static membership at Kafka Summit SF 2019
Yu Yang Data Engineering Authored DoctorKafka open-source announcement (2017)
Ankit Patel Software Engineer, Content Acquisition and Media Platform Authored visual signals Lambda-to-Kappa architecture article (2020)
Parag Kesar, Ben Liu Software Engineers, Data Engineering Co-authored real-time experiment analytics with Flink article (2019)
Liang Mou, Yisheng Zhou, Elizabeth Nguyen, Owen Zhang Logging Platform Co-authored next-generation DB ingestion article (2026)

Key takeaways for your own Kafka implementation

Design for cluster isolation from the start. Pinterest's brokerset concept — virtual clusters with stride-based partition assignment within a physical cluster — gives you topic-level failure isolation without the operational overhead of provisioning a separate physical cluster for every workload. The key is capping cluster size (Pinterest uses 200 brokers) and using the brokerset as the unit of risk management.

AZ-aware routing pays for itself at scale. Default Kafka clients are AZ-agnostic, and at a large enough fleet the cross-AZ data transfer costs are non-trivial. Pinterest built a custom partitioner using EC2 Metadata and Kafka rack awareness that routes producers and consumers to local partition leaders, achieving a 25% reduction in transfer costs. This is a relatively low-effort change for the savings it produces in cloud environments.

Replication overhead is often decompression, not network. Pinterest found that MirrorMaker's CPU cost was dominated by decompression and recompression, not by network I/O. The Shallow Mirror approach — treating RecordBatches as opaque byte buffers during replication — reduced MirrorMaker CPU load by over 80%. Before investing in replication hardware, it is worth profiling whether the bottleneck is actually the wire or the processing.

Broker-side tiered storage is not the only option. Pinterest deliberately chose a sidecar architecture for tiered storage rather than waiting for KIP-405's in-broker implementation. The sidecar Segment Uploader is operationally independent of the broker process, which simplifies upgrades and means a broker restart does not interrupt the upload pipeline. If tiered storage is a priority, evaluate whether broker-coupled or broker-decoupled implementations better fit your upgrade cadence and failure domains.

A unified client abstraction reduces operational debt significantly. PSC gave Pinterest's platform team a single control point for client configurations, service discovery, and automated error remediation across both Kafka and MemQ. The result was over 80% fewer Flink restarts from client errors and an estimated 275 fewer FTE hours per year in KTLO work. A thin abstraction layer that separates application code from broker topology is worth the investment when you operate at more than a handful of clusters.

Sources and further reading

  1. How Pinterest runs Kafka at scale - Henry Cai, Shawn Nguyen, Yi Yin, Liquan Pei et al., Pinterest Engineering Blog, 2018
  2. Real-time analytics at Pinterest - Krishna Gade, Pinterest Engineering Blog, 2015
  3. Running Kafka at scale at Pinterest - Eric Lopez, Heng Zhang, Henry Cai, Jeff Xiang et al., Confluent Engineering Blog, 2021
  4. Pinterest tiered storage for Apache Kafka: a broker-decoupled approach - Jeff Xiang, Vahid Hashemian, Pinterest Engineering Blog, 2024
  5. Open-sourcing DoctorKafka: Kafka cluster healing and workload balancing - Yu Yang, Pinterest Engineering Blog, 2017
  6. Using graph algorithms to optimize Kafka operations, Part 1 - Ping-Min Lin, Pinterest Engineering Blog, 2020
  7. Shallow Mirror - Henry Cai, Pinterest Engineering Blog, 2021
  8. Fighting spam with Guardian: a real-time analytics and rules engine - Hongkai Pan, Pinterest Engineering Blog, 2021
  9. Pinterest visual signals infrastructure: evolution from Lambda to Kappa architecture - Ankit Patel, Pinterest Engineering Blog, 2020
  10. Real-time experiment analytics at Pinterest using Apache Flink - Parag Kesar, Ben Liu, Pinterest Engineering Blog, 2019
  11. MemQ: an efficient, scalable cloud-native PubSub system - Ambud Sharma, Pinterest Engineering Blog, 2021
  12. Unified PubSub Client at Pinterest - Vahid Hashemian, Jeff Xiang, Pinterest Engineering Blog, 2022
  13. Open-sourcing Singer: Pinterest's performant and reliable logging agent - Ambud Sharma, Indy Prentice, Henry Cai, Shawn Nguyen et al., Pinterest Engineering Blog, 2019
  14. pinterest/orion on GitHub - Pinterest Engineering
  15. Next-generation DB ingestion at Pinterest - Liang Mou, Yisheng Zhou, Elizabeth Nguyen, Owen Zhang, Pinterest Engineering Blog, 2026
  16. Running Unified PubSub Client in production at Pinterest - Jeff Xiang, Vahid Hashemian, Jesus Zuniga, Pinterest Engineering Blog, 2023
  17. Optimizing Kafka for the cloud - Eric Lopez, Henry Cai, Heng Zhang, Ping-Min Lin et al., Pinterest Engineering Blog, 2019
  18. Static membership rebalance strategy designed for the cloud (Kafka Summit SF 2019) - Liquan Pei (Pinterest), Boyang Chen (Confluent), Kafka Summit SF 2019

If you are managing a Kafka deployment at scale, Kpow gives you a single interface for monitoring consumer lag, inspecting topics, and managing cluster operations across multiple clusters. A free 30-day trial is available and connects to any Kafka cluster in minutes via Docker, Helm, or JAR.