
How PayPal uses Apache Kafka in production
Table of contents
PayPal's Apache Kafka deployment sits at a scale that most engineering teams will only read about: more than 85 clusters, over 1,500 brokers, and a peak throughput of 1.3 trillion messages per day during Retail Friday 2022 — roughly 21 million messages per second. Kafka underpins nearly every data pipeline at PayPal, from real-time fraud scoring to clickstream ingestion to database synchronisation, and the engineering challenges involved in running it reliably at that volume have shaped both internal tooling and the open-source project itself.
Company overview
PayPal is a global payments platform operating in more than 200 markets. At the time of writing, it processes hundreds of millions of transactions per year across consumer, merchant, and financial-services products including PayPal, Venmo, Braintree, and Xoom.
The company adopted Kafka in 2015, starting with a handful of isolated clusters for specific pipelines. Over the following years, payment volume and product surface area grew steadily, and Kafka evolved from a collection of point-to-point pipelines into a centralised event-streaming fabric used across dozens of teams and use cases.
Key Kafka milestones:
- 2015: Kafka introduced as isolated clusters for specific use cases
- 2016: Lambda architecture documented, combining Kafka with Spark Streaming for near-real-time analytics
- 2018: 400 billion messages per day across 40+ clusters in three geographically distributed data centres
- 2020: Approaching 1 trillion messages per day; multi-tenant fleet management formalised
- 2021: Zero-downtime migration of 20+ clusters (1,000+ broker and ZooKeeper nodes, 60+ MirrorMaker groups) across data centres
- 2022: 1.3 trillion messages per day at Retail Friday peak (21 million messages per second)
- 2023: Fleet documented at 85+ clusters, 1,500+ brokers, 20,000+ topics, approximately 2,000 MirrorMaker nodes, 99.99% availability
PayPal's Kafka use cases
PayPal uses Kafka across seven primary use cases, each processing more than 100 billion messages per day.
First-party user behaviour tracking is one of the busiest pipelines. All clickstream and interaction events from PayPal's web and mobile surfaces are ingested via Kafka and enriched before being distributed to downstream analytics and personalisation systems. The pipeline processes 18.8 billion messages per day in that use case alone, up from 9 billion before a reactive architecture rewrite in 2018.
Application health metrics are streamed through Kafka from brokers, ZooKeeper nodes, MirrorMakers, and Kafka clients themselves. A custom metrics library registers telemetry via Micrometer and forwards it to the SignalFX observability backend.
Database synchronisation uses Kafka as a change-data-capture transport, streaming operational database changes to downstream consumers for near-real-time replication across services.
Application log aggregation consolidates logs from across PayPal's service estate via Kafka, feeding log-analysis and compliance tooling.
Risk detection and management depends on Kafka to move payment events to fraud-scoring and anomaly-detection models with low latency. Spark Streaming jobs consume directly from Kafka topics to support real-time risk decisions at transaction time.
Analytics and compliance pipelines use Kafka as the backbone for several analytical systems, including a consumer application that ingests 30 to 35 billion daily events from Kafka and publishes them to BigQuery on Google Cloud. Before this pipeline existed, the reporting latency for the underlying data was 12 hours; after the Kafka-based rewrite, it dropped to seconds.
Batch processing supplements the real-time pipelines. Event streams also feed macro-batch Spark jobs used for end-of-day reconciliation and reporting.
Scale and throughput
The growth trajectory is as informative as the peak numbers. PayPal started with isolated clusters in 2015, reached 400 billion messages per day by 2018, crossed 1 trillion by 2021, and then handled a 30% increase on top of that during the 2022 Retail Friday peak. Quarter-on-quarter traffic growth of around 30% was noted by Maulin Vasavada at Kafka Summit 2020, which helps explain why PayPal's engineering approach has consistently prioritised operational automation over manual intervention.
PayPal's Kafka architecture
Deployment model
PayPal runs Kafka brokers on bare metal for I/O performance reasons. ZooKeeper and MirrorMaker instances run on virtual machines. The production Kafka fleet spans multiple geographically distributed data centres, with clusters segmented by security zone within each data centre. Cluster placement is governed by data classification requirements and business criticality rather than a single flat topology.
The QA environment is hosted on Google Cloud Platform, with brokers spread across multiple GCP availability zones. The GCP QA environment maps directly to production cluster topology, follows the same security standards, and cost 75% less than the previous on-premises QA setup while delivering 40% better performance.
Kafka Config Service
One of the more consequential internal tools at PayPal is the Kafka Config Service: a highly available, stateless service that pushes bootstrap server addresses and standardised client configurations to applications. Rather than hard-coding broker IPs in application configs, each application queries the Config Service at startup.
This decouples application deployments from broker topology changes. When clusters are rebalanced, scaled, or reorganised, the Config Service absorbs the change and applications continue connecting without redeployment. It also reduces the operational burden on the SRE team, since broker address updates do not require coordinating application-side config changes across hundreds of teams.
Security and access control
SASL-based authentication is enforced across all clusters. Applications must authenticate and declare producer or consumer intent before connecting. Each topic onboarding request generates a unique authentication token scoped to that topic. The plaintext port that existed in the earlier deployment has been removed.
This replaced a configuration where applications could connect without identifying themselves. The shift to ACLs gave the Kafka team visibility into exactly which producers and consumers were using each topic, which both improved security posture and simplified capacity planning.
Topic onboarding
Application teams submit onboarding requests through an internal Onboarding Dashboard. A capacity analysis tool integrated into the workflow evaluates cluster placement before the topic is provisioned. MirrorMaker groups for any cross-zone replication requirements are set up as part of the same workflow.
Cross-cluster replication with MirrorMaker
PayPal uses approximately 2,000 MirrorMaker nodes to replicate events between clusters. Replication serves two purposes: disaster recovery (maintaining a mirror in a secondary data centre) and inter-security-zone communication (moving data from a high-classification zone to a lower one where consuming services operate).
In 2021, Lei Huang and Na Yang presented the data centre migration at Kafka Summit Americas. The team migrated 20+ clusters (1,000+ broker and ZooKeeper nodes, 60+ MirrorMaker groups) across data centres with zero service outage, zero message loss, and no application-side changes required. MirrorMaker groups provided the traffic cut-over path, with traffic gradually shifting from source clusters to destination clusters through the MirrorMaker pipeline.
Producer architecture
The BigQuery pipeline producer uses a batch size of 250,000 bytes and a linger time of 5 seconds to optimise throughput over latency. For the user behaviour tracking pipeline, a Chronicle Queue (memory-mapped file store) sits between the HTTP ingestion layer and the Kafka producer. When a Kafka consumer group rebalance causes the producer to become temporarily unavailable, events buffer to disk rather than being dropped. Once the rebalance completes, the Chronicle Queue drains into the Kafka producer in order.
Consumer architecture
The BigQuery consumer application runs on 75 production machines (2-core CPU, 8 GB RAM each) and is built on Project Reactor (reactive streams). Key configuration values: max.poll.records=1,000, fetch.min.bytes approximately 100 KB, max.partition.fetch.bytes approximately 1 MB. The application manages 300 partitions in production. GC was switched from Concurrent Mark Sweep to G1GC during benchmarking, which improved stability under high throughput. Sustained throughput of approximately 950,000 events per minute was achieved per instance.
Session timeout is set to 60 seconds, with a heartbeat interval of 20 seconds.
Stream processing
Spark Streaming jobs consume Kafka topics directly for risk detection and metrics aggregation pipelines. The user behaviour tracking pipeline uses Akka Streams (via the Squbs reactive framework) for in-process stream processing and fan-in from multiple HTTP sources. The MergeHub Akka primitive is used to dynamically merge HTTP connection streams into the downstream enrichment flow, so each new HTTP connection becomes its own sub-stream without reconfiguring the pipeline.
Client library ecosystem
PayPal maintains three internal client libraries:
- A connectivity library for resilient connection to the Config Service and brokers
- A monitoring library that registers Kafka client metrics via Micrometer and forwards them to SignalFX
- A security library that manages SSL certificate lifecycle
Supported languages are Java, Python, Go, Scala, and Node.js, as well as applications built on the internal Squbs framework.
Special techniques and engineering innovations
Kafka Config Service is the most operationally significant internal tool. It removes the dependency between broker topology and application configuration, enabling infrastructure changes without application redeployment. At a scale of 85+ clusters, the operational leverage this provides is substantial.
Chronicle Queue buffering addresses a specific failure mode in the user behaviour tracking pipeline. Kafka consumer group rebalances can cause brief producer unavailability. Without a durable buffer, events arriving during a rebalance are either queued in memory (at risk of loss on restart) or dropped. Chronicle Queue writes to memory-mapped files on disk, providing durability at near-memory speeds. This pattern is not standard Kafka tooling; PayPal built it to meet the data loss requirements of a pipeline processing tens of billions of events per day.
Smart patching plugin checks under-replicated partition (URP) counts before initiating any broker patch. A patch proceeds only when the affected broker's partitions are fully replicated. This gating condition enables parallel patching of multiple clusters concurrently, with single-broker restarts proceeding in sequence within each cluster. Before this plugin, patching at 1,500+ broker scale stretched maintenance windows to multiple days.
Targeted partition reassignment modifies the default Kafka reassignment behaviour. The default algorithm reassigns all partitions on a broker, which at PayPal's partition counts produces very long rebalancing windows. PayPal's modification restricts reassignment to only the under-replicated partitions on the affected broker, which dramatically shortens the rebalancing duration.
Upstream open-source contributions reflect the depth of operational experience the team has developed. PayPal authored three Kafka Improvement Proposals:
- KIP-351: adds a
--under-min-isrflag tokafka-topics.shfor identifying partitions below minimum in-sync replicas - KIP-427: adds an
at-min-isrpartition category metric to the broker - KIP-517: adds consumer polling behaviour metrics for observing
max.poll.interval.mscompliance
All three address monitoring gaps that PayPal encountered running Kafka at scale before these metrics existed natively.
Operating Kafka at scale
Deployment model: Self-managed, on bare metal (brokers) and VMs (ZooKeeper, MirrorMaker), with QA on Google Cloud Platform.
Monitoring and alerting: The Kafka Metrics library collects metrics from brokers, ZooKeeper nodes, MirrorMakers, and Kafka clients. Metrics are registered via Micrometer and forwarded to SignalFX. Alert thresholds are tuned to fire on actionable conditions rather than informational noise. The team made deliberate decisions about which metrics to surface, discarding many default metrics that did not correlate reliably with user-facing issues.
Capacity management: Infrastructure is scaled ahead of peak periods. The topic onboarding workflow includes a capacity analysis step so that new topics are placed on clusters with headroom. Retail Friday planning involves expanding broker capacity to accommodate the expected traffic surge above the 21 million messages per second baseline.
Developer experience: The Onboarding Dashboard and Config Service form the main interface between application teams and the Kafka platform. Application teams do not need to know which cluster a topic lives on or manage their own broker bootstrap configurations; the platform handles that. This abstraction has enabled broad adoption across teams without requiring each team to develop deep Kafka operational knowledge.
Upgrade and migration strategy: The 2021 data centre migration demonstrated PayPal's approach to large-scale infrastructure changes: use MirrorMaker pipelines to shift traffic gradually, maintain application transparency throughout, and validate at each step before cutting over. The same principles apply to broker version upgrades, where the URP-checking plugin provides the gate condition for rolling restarts.
Challenges and how they solved them
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Decouple application config from broker topology. PayPal's Kafka Config Service, which pushes bootstrap addresses and client config to applications at runtime, is one of the most operationally significant tools in their stack. At any meaningful scale, hard-coded broker IPs become a maintenance liability. A centralised config distribution layer makes broker changes transparent to applications.
- Add a durable buffer ahead of your producer if message loss during rebalances is unacceptable. Kafka consumer group rebalances are not exceptional events; they happen routinely as deployments roll, instances scale, and partitions reassign. If your producer pipeline has no durable buffer, events arriving during a rebalance window are at risk. Chronicle Queue (or an equivalent mechanism) can absorb the gap with minimal latency overhead.
- Gate your patching and reassignment operations on under-replicated partition counts. PayPal's URP-checking plugin and targeted reassignment modification both reflect the same principle: do not proceed with infrastructure changes while replication is degraded. Building this check into your automation, rather than relying on manual observation, is what enables parallel patching at scale.
- Invest in your QA environment's fidelity early. PayPal's migration of QA to GCP delivered a 75% cost reduction and 40% performance improvement. A QA environment that accurately reflects production topology is the foundation for safe rollouts, and the cost of maintaining it does not have to scale linearly with production.
- Contribute upstream when you find gaps in observability. KIP-351, KIP-427, and KIP-517 all addressed monitoring blind spots that PayPal encountered before these metrics existed in Kafka natively. If you are running Kafka at scale and hitting the edges of what the built-in metrics tell you, the upstream KIP process is a viable path to a permanent fix.
Sources and further reading
Primary sources:
- Monish Koppa, "Scaling Kafka to Support PayPal's Data Growth," PayPal developer blog, September 2023 — https://developer.paypal.com/community/blog/scaling-kafka-to-support-paypals-data-growth/
- Monish Koppa, "Scaling Kafka to Support PayPal's Data Growth," The PayPal Technology Blog, Medium — https://medium.com/paypal-tech/scaling-kafka-to-support-paypals-data-growth-a0b4da420fab
- Maulin Vasavada, "Marching Toward a Trillion Kafka Messages per Day," Kafka Summit 2020, Confluent — https://www.confluent.io/resources/kafka-summit-2020/marching-toward-a-trillion-kafka-messages-per-day-running-kafka-at-scale-at-paypal/
- Lei Huang and Na Yang, "How Did We Move the Mountain? Migrating 1 Trillion+ Messages Per Day Across Data Centers at PayPal," Kafka Summit Americas 2021, Confluent — https://www.confluent.io/events/kafka-summit-americas-2021/how-did-we-move-the-mountain-migrating-1-trillion-messages-per-day-across/
- Kevin Lu, Maulin Vasavada, Na Yang, "Kafka at PayPal: Enabling 400 Billion Messages a Day," Strata Data Conference NY 2018 — https://conferences.oreilly.com/strata/strata-ny-2018/public/schedule/detail/69459.html
- Archit Agarwal et al., "Scaling Kafka Consumer for Billions of Events," The PayPal Technology Blog, Medium — https://medium.com/paypal-tech/kafka-consumer-benchmarking-c726fbe4000
- Sakshi Ganeriwal, "Tracking User Behavior at Scale with Streaming Reactive Big Data Systems," The PayPal Technology Blog, Medium, September 2018 — https://medium.com/paypal-tech/https-medium-com-paypal-engineering-tracking-user-behavior-at-scale-f0c584c4ddd4
- Michael Zeltser, "From Big Data to Fast Data in Four Weeks — Part 2," The PayPal Technology Blog, Medium, November 2016 — https://medium.com/paypal-tech/from-big-data-to-fast-data-in-four-weeks-or-how-reactive-programming-is-changing-the-world-part-2-29a9f7d48318
Kafka Improvement Proposals authored by PayPal:
- KIP-351 — Add --under-min-isr option to describe topics command
- KIP-427 — Add AtMinIsr topic partition category (new metric & TopicCommand option)
- KIP-517 — Add consumer metrics to observe user poll behavior
If you are running Kafka in production and want deeper visibility into your brokers, consumer lag, and topic health, Kpow gives you a full-featured monitoring and management UI that connects to any Kafka cluster in minutes. You can try it free for 30 days.