Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How PayPal uses Apache Kafka in production

May 16th, 2026

xx min read

PayPal's Apache Kafka deployment sits at a scale that most engineering teams will only read about: more than 85 clusters, over 1,500 brokers, and a peak throughput of 1.3 trillion messages per day during Retail Friday 2022 — roughly 21 million messages per second. Kafka underpins nearly every data pipeline at PayPal, from real-time fraud scoring to clickstream ingestion to database synchronisation, and the engineering challenges involved in running it reliably at that volume have shaped both internal tooling and the open-source project itself.

Company overview

PayPal is a global payments platform operating in more than 200 markets. At the time of writing, it processes hundreds of millions of transactions per year across consumer, merchant, and financial-services products including PayPal, Venmo, Braintree, and Xoom.

The company adopted Kafka in 2015, starting with a handful of isolated clusters for specific pipelines. Over the following years, payment volume and product surface area grew steadily, and Kafka evolved from a collection of point-to-point pipelines into a centralised event-streaming fabric used across dozens of teams and use cases.

Key Kafka milestones:

2015: Kafka introduced as isolated clusters for specific use cases
2016: Lambda architecture documented, combining Kafka with Spark Streaming for near-real-time analytics
2018: 400 billion messages per day across 40+ clusters in three geographically distributed data centres
2020: Approaching 1 trillion messages per day; multi-tenant fleet management formalised
2021: Zero-downtime migration of 20+ clusters (1,000+ broker and ZooKeeper nodes, 60+ MirrorMaker groups) across data centres
2022: 1.3 trillion messages per day at Retail Friday peak (21 million messages per second)
2023: Fleet documented at 85+ clusters, 1,500+ brokers, 20,000+ topics, approximately 2,000 MirrorMaker nodes, 99.99% availability

PayPal's Kafka use cases

PayPal uses Kafka across seven primary use cases, each processing more than 100 billion messages per day.

First-party user behaviour tracking is one of the busiest pipelines. All clickstream and interaction events from PayPal's web and mobile surfaces are ingested via Kafka and enriched before being distributed to downstream analytics and personalisation systems. The pipeline processes 18.8 billion messages per day in that use case alone, up from 9 billion before a reactive architecture rewrite in 2018.

Application health metrics are streamed through Kafka from brokers, ZooKeeper nodes, MirrorMakers, and Kafka clients themselves. A custom metrics library registers telemetry via Micrometer and forwards it to the SignalFX observability backend.

Database synchronisation uses Kafka as a change-data-capture transport, streaming operational database changes to downstream consumers for near-real-time replication across services.

Application log aggregation consolidates logs from across PayPal's service estate via Kafka, feeding log-analysis and compliance tooling.

Risk detection and management depends on Kafka to move payment events to fraud-scoring and anomaly-detection models with low latency. Spark Streaming jobs consume directly from Kafka topics to support real-time risk decisions at transaction time.

Analytics and compliance pipelines use Kafka as the backbone for several analytical systems, including a consumer application that ingests 30 to 35 billion daily events from Kafka and publishes them to BigQuery on Google Cloud. Before this pipeline existed, the reporting latency for the underlying data was 12 hours; after the Kafka-based rewrite, it dropped to seconds.

Batch processing supplements the real-time pipelines. Event streams also feed macro-batch Spark jobs used for end-of-day reconciliation and reporting.

Scale and throughput

Metric	Value
Peak daily messages (Retail Friday 2022)	1.3 trillion
Peak message rate	21 million messages per second
Kafka clusters	85+
Brokers	1,500+
Topics	20,000+
MirrorMaker nodes	~2,000
Platform availability	99.99%
Messages per day per primary use case	100 billion+

‍

The growth trajectory is as informative as the peak numbers. PayPal started with isolated clusters in 2015, reached 400 billion messages per day by 2018, crossed 1 trillion by 2021, and then handled a 30% increase on top of that during the 2022 Retail Friday peak. Quarter-on-quarter traffic growth of around 30% was noted by Maulin Vasavada at Kafka Summit 2020, which helps explain why PayPal's engineering approach has consistently prioritised operational automation over manual intervention.

PayPal's Kafka architecture

Deployment model

PayPal runs Kafka brokers on bare metal for I/O performance reasons. ZooKeeper and MirrorMaker instances run on virtual machines. The production Kafka fleet spans multiple geographically distributed data centres, with clusters segmented by security zone within each data centre. Cluster placement is governed by data classification requirements and business criticality rather than a single flat topology.

The QA environment is hosted on Google Cloud Platform, with brokers spread across multiple GCP availability zones. The GCP QA environment maps directly to production cluster topology, follows the same security standards, and cost 75% less than the previous on-premises QA setup while delivering 40% better performance.

Kafka Config Service

One of the more consequential internal tools at PayPal is the Kafka Config Service: a highly available, stateless service that pushes bootstrap server addresses and standardised client configurations to applications. Rather than hard-coding broker IPs in application configs, each application queries the Config Service at startup.

This decouples application deployments from broker topology changes. When clusters are rebalanced, scaled, or reorganised, the Config Service absorbs the change and applications continue connecting without redeployment. It also reduces the operational burden on the SRE team, since broker address updates do not require coordinating application-side config changes across hundreds of teams.

Security and access control

SASL-based authentication is enforced across all clusters. Applications must authenticate and declare producer or consumer intent before connecting. Each topic onboarding request generates a unique authentication token scoped to that topic. The plaintext port that existed in the earlier deployment has been removed.

This replaced a configuration where applications could connect without identifying themselves. The shift to ACLs gave the Kafka team visibility into exactly which producers and consumers were using each topic, which both improved security posture and simplified capacity planning.

Topic onboarding

Application teams submit onboarding requests through an internal Onboarding Dashboard. A capacity analysis tool integrated into the workflow evaluates cluster placement before the topic is provisioned. MirrorMaker groups for any cross-zone replication requirements are set up as part of the same workflow.

Cross-cluster replication with MirrorMaker

PayPal uses approximately 2,000 MirrorMaker nodes to replicate events between clusters. Replication serves two purposes: disaster recovery (maintaining a mirror in a secondary data centre) and inter-security-zone communication (moving data from a high-classification zone to a lower one where consuming services operate).

In 2021, Lei Huang and Na Yang presented the data centre migration at Kafka Summit Americas. The team migrated 20+ clusters (1,000+ broker and ZooKeeper nodes, 60+ MirrorMaker groups) across data centres with zero service outage, zero message loss, and no application-side changes required. MirrorMaker groups provided the traffic cut-over path, with traffic gradually shifting from source clusters to destination clusters through the MirrorMaker pipeline.

Producer architecture

The BigQuery pipeline producer uses a batch size of 250,000 bytes and a linger time of 5 seconds to optimise throughput over latency. For the user behaviour tracking pipeline, a Chronicle Queue (memory-mapped file store) sits between the HTTP ingestion layer and the Kafka producer. When a Kafka consumer group rebalance causes the producer to become temporarily unavailable, events buffer to disk rather than being dropped. Once the rebalance completes, the Chronicle Queue drains into the Kafka producer in order.

Consumer architecture

The BigQuery consumer application runs on 75 production machines (2-core CPU, 8 GB RAM each) and is built on Project Reactor (reactive streams). Key configuration values: max.poll.records=1,000, fetch.min.bytes approximately 100 KB, max.partition.fetch.bytes approximately 1 MB. The application manages 300 partitions in production. GC was switched from Concurrent Mark Sweep to G1GC during benchmarking, which improved stability under high throughput. Sustained throughput of approximately 950,000 events per minute was achieved per instance.

Session timeout is set to 60 seconds, with a heartbeat interval of 20 seconds.

Stream processing

Spark Streaming jobs consume Kafka topics directly for risk detection and metrics aggregation pipelines. The user behaviour tracking pipeline uses Akka Streams (via the Squbs reactive framework) for in-process stream processing and fan-in from multiple HTTP sources. The MergeHub Akka primitive is used to dynamically merge HTTP connection streams into the downstream enrichment flow, so each new HTTP connection becomes its own sub-stream without reconfiguring the pipeline.

Client library ecosystem

PayPal maintains three internal client libraries:

A connectivity library for resilient connection to the Config Service and brokers
A monitoring library that registers Kafka client metrics via Micrometer and forwards them to SignalFX
A security library that manages SSL certificate lifecycle

Supported languages are Java, Python, Go, Scala, and Node.js, as well as applications built on the internal Squbs framework.

Special techniques and engineering innovations

Kafka Config Service is the most operationally significant internal tool. It removes the dependency between broker topology and application configuration, enabling infrastructure changes without application redeployment. At a scale of 85+ clusters, the operational leverage this provides is substantial.

Chronicle Queue buffering addresses a specific failure mode in the user behaviour tracking pipeline. Kafka consumer group rebalances can cause brief producer unavailability. Without a durable buffer, events arriving during a rebalance are either queued in memory (at risk of loss on restart) or dropped. Chronicle Queue writes to memory-mapped files on disk, providing durability at near-memory speeds. This pattern is not standard Kafka tooling; PayPal built it to meet the data loss requirements of a pipeline processing tens of billions of events per day.

Smart patching plugin checks under-replicated partition (URP) counts before initiating any broker patch. A patch proceeds only when the affected broker's partitions are fully replicated. This gating condition enables parallel patching of multiple clusters concurrently, with single-broker restarts proceeding in sequence within each cluster. Before this plugin, patching at 1,500+ broker scale stretched maintenance windows to multiple days.

Targeted partition reassignment modifies the default Kafka reassignment behaviour. The default algorithm reassigns all partitions on a broker, which at PayPal's partition counts produces very long rebalancing windows. PayPal's modification restricts reassignment to only the under-replicated partitions on the affected broker, which dramatically shortens the rebalancing duration.

Upstream open-source contributions reflect the depth of operational experience the team has developed. PayPal authored three Kafka Improvement Proposals:

KIP-351: adds a --under-min-isr flag to kafka-topics.sh for identifying partitions below minimum in-sync replicas
KIP-427: adds an at-min-isr partition category metric to the broker
KIP-517: adds consumer polling behaviour metrics for observing max.poll.interval.ms compliance

All three address monitoring gaps that PayPal encountered running Kafka at scale before these metrics existed natively.

Operating Kafka at scale

Deployment model: Self-managed, on bare metal (brokers) and VMs (ZooKeeper, MirrorMaker), with QA on Google Cloud Platform.

Monitoring and alerting: The Kafka Metrics library collects metrics from brokers, ZooKeeper nodes, MirrorMakers, and Kafka clients. Metrics are registered via Micrometer and forwarded to SignalFX. Alert thresholds are tuned to fire on actionable conditions rather than informational noise. The team made deliberate decisions about which metrics to surface, discarding many default metrics that did not correlate reliably with user-facing issues.

Capacity management: Infrastructure is scaled ahead of peak periods. The topic onboarding workflow includes a capacity analysis step so that new topics are placed on clusters with headroom. Retail Friday planning involves expanding broker capacity to accommodate the expected traffic surge above the 21 million messages per second baseline.

Developer experience: The Onboarding Dashboard and Config Service form the main interface between application teams and the Kafka platform. Application teams do not need to know which cluster a topic lives on or manage their own broker bootstrap configurations; the platform handles that. This abstraction has enabled broad adoption across teams without requiring each team to develop deep Kafka operational knowledge.

Upgrade and migration strategy: The 2021 data centre migration demonstrated PayPal's approach to large-scale infrastructure changes: use MirrorMaker pipelines to shift traffic gradually, maintain application transparency throughout, and validate at each step before cutting over. The same principles apply to broker version upgrades, where the URP-checking plugin provides the gate condition for rolling restarts.

Challenges and how they solved them

Challenge	Root cause	Solution	Outcome
Unauthenticated client connections	Early clusters had a plaintext port with no authentication requirement	SASL-based ACLs requiring clients to authenticate and declare producer or consumer intent; unique tokens per topic	Full visibility of producers and consumers; anonymous access eliminated
Broker patching at 1,500+ node scale	Default patching interrupted replication; sequential single-broker patching was impractical across 85+ clusters	URP-checking plugin gates patches on replication health; enables parallel cluster patching with single-broker restarts	Patching windows reduced from multiple days to hours
Long rebalancing after broker changes	Default reassignment algorithm moves all partitions on a broker, producing very long rebalancing windows at high partition counts	Modified reassignment to restrict to only under-replicated partitions on the affected broker	Dramatically shorter rebalancing duration
High QA environment cost	QA clusters mirrored production footprint on-premises	Rebuilt QA on Google Cloud Platform with multi-zone brokers	75% cost reduction; 40% performance improvement
Message loss during consumer group rebalances	Kafka producer back-pressure had no durable buffer; events arriving during a rebalance were at risk of loss	Chronicle Queue (memory-mapped disk buffer) inserted ahead of the Kafka producer in the user tracking pipeline	Zero message loss during rebalance events
Analytics latency of 12 hours	Reports depended on macro-batch Spark jobs running on multi-hour intervals	Kafka consumer pipeline publishing to BigQuery in real time using Project Reactor	Analytics latency reduced from 12 hours to seconds
Zero-downtime data centre migration at 1 trillion+ messages per day	Migrating 20+ clusters required cut-over without service interruption or data loss	60+ MirrorMaker groups used for gradual, transparent traffic migration with no application-side changes	0% service outage; 0% message loss

Full tech stack

Category	Tools	Notes
Message broker	Apache Kafka	Bare metal deployment for broker nodes
Coordination	Apache ZooKeeper	VM-deployed
Replication	MirrorMaker	~2,000 nodes on VMs; cross-DC and cross-security-zone replication
Stream processing	Apache Spark Streaming, Akka Streams (via Squbs), Project Reactor	Spark for risk/analytics pipelines; Squbs for user tracking; Reactor for BigQuery pipeline
Monitoring	SignalFX (via Micrometer and Kafka Metrics library)	Broker, ZooKeeper, MirrorMaker, and client metrics; alerting on threshold breaches
QA infrastructure	Google Cloud Platform (multi-zone)	Topic-for-topic mapping of production clusters; same security posture as production
Client languages	Java, Python, Go, Scala, Node.js	Internal reactive framework: Squbs (Akka Streams-based)
Persistent buffer	Chronicle Queue	Memory-mapped disk buffer ahead of Kafka producer in user tracking pipeline
Data sinks	BigQuery (Google Cloud), Teradata	BigQuery: real-time analytics pipeline; Teradata: batch and warehouse workloads
Config distribution	Internal Kafka Config Service	Stateless; distributes bootstrap addresses and client config to all applications
Certificate management	Internal security library	SSL certificate lifecycle management for all Kafka clients
Metrics instrumentation	Micrometer	Client-side metrics registration layer; forwards to SignalFX

Key contributors

Name	Role	Source
Maulin Vasavada	Software Developer and Architect, Kafka team, PayPal	Strata NY 2018; Kafka Summit 2020
Kevin Lu	Software Engineer, PayPal	Strata NY 2018
Na Yang	Engineering Lead, PayPal	Kafka Summit Americas 2021
Lei Huang	MTS2 Engineer, PayPal	Kafka Summit Americas 2021
Monish Koppa	Software Engineer, PayPal	PayPal developer blog, September 2023
Archit Agarwal	Engineer, PayPal	PayPal Tech Blog — consumer benchmarking
Sakshi Ganeriwal	Engineer, PayPal	PayPal Tech Blog — user behaviour tracking, 2018
Michael Zeltser	Engineer, PayPal	PayPal Tech Blog — fast data architecture, 2016

Key takeaways for your own Kafka implementation

Decouple application config from broker topology. PayPal's Kafka Config Service, which pushes bootstrap addresses and client config to applications at runtime, is one of the most operationally significant tools in their stack. At any meaningful scale, hard-coded broker IPs become a maintenance liability. A centralised config distribution layer makes broker changes transparent to applications.
Add a durable buffer ahead of your producer if message loss during rebalances is unacceptable. Kafka consumer group rebalances are not exceptional events; they happen routinely as deployments roll, instances scale, and partitions reassign. If your producer pipeline has no durable buffer, events arriving during a rebalance window are at risk. Chronicle Queue (or an equivalent mechanism) can absorb the gap with minimal latency overhead.
Gate your patching and reassignment operations on under-replicated partition counts. PayPal's URP-checking plugin and targeted reassignment modification both reflect the same principle: do not proceed with infrastructure changes while replication is degraded. Building this check into your automation, rather than relying on manual observation, is what enables parallel patching at scale.
Invest in your QA environment's fidelity early. PayPal's migration of QA to GCP delivered a 75% cost reduction and 40% performance improvement. A QA environment that accurately reflects production topology is the foundation for safe rollouts, and the cost of maintaining it does not have to scale linearly with production.
Contribute upstream when you find gaps in observability. KIP-351, KIP-427, and KIP-517 all addressed monitoring blind spots that PayPal encountered before these metrics existed in Kafka natively. If you are running Kafka at scale and hitting the edges of what the built-in metrics tell you, the upstream KIP process is a viable path to a permanent fix.

Sources and further reading

Primary sources:

Monish Koppa, "Scaling Kafka to Support PayPal's Data Growth," PayPal developer blog, September 2023 — https://developer.paypal.com/community/blog/scaling-kafka-to-support-paypals-data-growth/
Monish Koppa, "Scaling Kafka to Support PayPal's Data Growth," The PayPal Technology Blog, Medium — https://medium.com/paypal-tech/scaling-kafka-to-support-paypals-data-growth-a0b4da420fab
Maulin Vasavada, "Marching Toward a Trillion Kafka Messages per Day," Kafka Summit 2020, Confluent — https://www.confluent.io/resources/kafka-summit-2020/marching-toward-a-trillion-kafka-messages-per-day-running-kafka-at-scale-at-paypal/
Lei Huang and Na Yang, "How Did We Move the Mountain? Migrating 1 Trillion+ Messages Per Day Across Data Centers at PayPal," Kafka Summit Americas 2021, Confluent — https://www.confluent.io/events/kafka-summit-americas-2021/how-did-we-move-the-mountain-migrating-1-trillion-messages-per-day-across/
Kevin Lu, Maulin Vasavada, Na Yang, "Kafka at PayPal: Enabling 400 Billion Messages a Day," Strata Data Conference NY 2018 — https://conferences.oreilly.com/strata/strata-ny-2018/public/schedule/detail/69459.html
Archit Agarwal et al., "Scaling Kafka Consumer for Billions of Events," The PayPal Technology Blog, Medium — https://medium.com/paypal-tech/kafka-consumer-benchmarking-c726fbe4000
Sakshi Ganeriwal, "Tracking User Behavior at Scale with Streaming Reactive Big Data Systems," The PayPal Technology Blog, Medium, September 2018 — https://medium.com/paypal-tech/https-medium-com-paypal-engineering-tracking-user-behavior-at-scale-f0c584c4ddd4
Michael Zeltser, "From Big Data to Fast Data in Four Weeks — Part 2," The PayPal Technology Blog, Medium, November 2016 — https://medium.com/paypal-tech/from-big-data-to-fast-data-in-four-weeks-or-how-reactive-programming-is-changing-the-world-part-2-29a9f7d48318

Kafka Improvement Proposals authored by PayPal:

KIP-351 — Add --under-min-isr option to describe topics command
KIP-427 — Add AtMinIsr topic partition category (new metric & TopicCommand option)
KIP-517 — Add consumer metrics to observe user poll behavior

If you are running Kafka in production and want deeper visibility into your brokers, consumer lag, and topic health, Kpow gives you a full-featured monitoring and management UI that connects to any Kafka cluster in minutes. You can try it free for 30 days.

Kafka cluster monitoring

A detailed guide to Kafka producer monitoring

Kafka consumer monitoring and performance tuning

Kafka broker monitoring

Apache Kafka architecture: a complete guide to internals, components, and deployment

Kafka monitoring: a complete guide for platform engineers

Data governance for Kafka: introducing lineage support in Factor Platform

Defense in depth: unifying RBAC and data policies for transparent governance

CMAK: Review, pricing, and best alternatives in 2026

Kadeck: Review, pricing, and best alternatives in 2026

AKHQ: Review, pricing, and best alternatives in 2026

Redpanda Console: Review, pricing, and best alternatives in 2026