
How Cash App uses Apache Kafka in production
Table of contents
Cash App runs Apache Kafka as the primary eventing backbone for its distributed services, operated by an internal PubSub platform team that also provides SQS for job queues and NATS for real-time client updates. Two aspects of Cash App's Kafka story are worth studying in detail: the evently-cloud bridge service, which absorbs Kafka client initialisation overhead at a proxy layer and reduced its pod count from 100 to 15 without SLA regression; and a data-centric encryption model that assigns distinct AWS KMS-backed keys per Kafka topic rather than per service, operating at more than 8 TB of encrypted data per day across Kafka and gRPC transport.
Company overview
Cash App is a consumer financial services product developed by Block Inc., offering peer-to-peer payments, banking, investing, and Bitcoin services to tens of millions of users primarily in the United States and the United Kingdom. Engineering at Cash App operates independently from Block's other brands, with shared infrastructure provided by dedicated platform teams.
Kafka's role at Cash App is as the central asynchronous message bus. The PubSub platform team owns the Kafka cluster and provides it alongside complementary messaging primitives as a managed internal offering: Kafka for durable event streaming, SQS for job queues, and NATS for low-latency real-time client updates. The team also operates Kpow as the Kafka management UI.
Key Kafka milestones:
- Ongoing (pre-2022): Cash App runs Kafka for asynchronous event pub/sub; the evently-cloud Kotlin bridge service serves legacy services that cannot manage Kafka client connections directly.
- December 2024: Yoav Amit, Gelareh Taban, and Matthew Miller publish "Encryption using data-specific keys," documenting the shift from service-centric to data-centric encryption across Kafka and gRPC transport. Cash App is encrypting more than 8 TB/day at this point.
- Post-2022 (Block acquisition of Afterpay): Cash App's Delta Lake on Databricks adoption enables Spark Declarative Pipelines that read streaming data directly from Kafka for machine learning and data science workloads.
Cash App's Kafka use cases
Asynchronous event pub/sub
The primary use of Kafka at Cash App is as the durable pub/sub backbone for distributed services. Engineering teams across Cash App produce and consume events through a managed platform offering rather than owning Kafka client connections themselves. This keeps operational overhead concentrated in the PubSub platform team while providing consistent eventing semantics to product teams.
Legacy service integration via evently-cloud
Not all Cash App services were built to manage long-lived Kafka client connections. The evently-cloud service fills this gap: it is a Kotlin service that exposes a REST API for fetching events from Kafka topics at specified offsets. Legacy services interact with Kafka through HTTP calls to evently-cloud, which handles client lifecycle, caching, and offset management internally. Alec Holmes documented this architecture on the Cash App Code Blog.
Machine learning and data science pipelines
Cash App uses Databricks Spark Declarative Pipelines to read streaming data from Kafka, feeding machine learning and data science workloads. The data lake uses a medallion architecture — bronze, silver, and gold tiers — on Delta Lake, governed by Unity Catalog and AWS Glue. This allows ML teams to work with Kafka-sourced event data without managing stream processing infrastructure directly.
Encrypted data transport
Cash App treats Kafka as part of its data transport infrastructure alongside gRPC. Both are subject to the same application-layer encryption model: encryption keys are scoped per Kafka topic rather than per service, and any producer or consumer must hold the relevant topic key via AWS IAM. This model was documented by Yoav Amit, Gelareh Taban, and Matthew Miller in December 2024.
Scale and throughput
- Encryption volume: More than 8 TB of data per day encrypted at the application layer across Kafka and gRPC transport (Yoav Amit, Gelareh Taban, Matthew Miller — December 2024).
- evently-cloud pod count: Reduced from 100 pods to 15 pods after request-affinity optimisation, with no SLA regression (Alec Holmes).
- Total fleet size: Topic count, partition count, and cluster count are not publicly disclosed by Cash App.
Cash App's Kafka architecture
Deployment
Cash App's Kafka cluster runs on AWS. The evently-cloud service and its surrounding infrastructure run on Kubernetes using Istio as the service mesh. Before Istio, each service used a dedicated Elastic Load Balancer; after migration, load balancing is handled client-side via Envoy sidecars.
evently-cloud: Kafka bridge service
Kafka clients take several seconds to initialise and seek to specific topic offsets. In a service that creates a new client per request, this latency is incurred on every API call. evently-cloud avoids this by maintaining one long-lived cached Kafka client per topic consumer. When a request arrives, the service checks its cache; if a client already exists for that topic, the request is handled immediately without initialisation overhead.
The caching strategy only works if requests for the same topic consistently reach the same pod. Cash App solved this with Istio-based request affinity: Kafka topic identifiers are passed as HTTP headers, and Istio's consistent-hashing load balancing routes requests for the same topic to the same pod. The affinity rule is topic-level, not session-level, which means the benefit applies across all consumers of a given topic regardless of which service made the request.
After deploying request affinity, the evently-cloud deployment was reduced from 100 pods to 15 pods. Cache hit rates improved, and the SLA was unaffected.
Per-topic encryption at application layer
Cash App's encryption model shifted from a service-centric design — where services held keys tied to their own identity — to a data-centric design, where encryption keys are associated with the data itself. For Kafka, this means a distinct Tink keyset per topic. Any workload that produces or consumes from a topic must have explicit access to that topic's key via AWS IAM policies and role chaining.
Keys are stored as Tink keysets in S3 buckets, backed by AWS KMS Customer Managed Keys. The practical effect is that a compromised service cannot decrypt data from topics outside its own access grants, even if it can reach the Kafka broker. The same model applies to gRPC transport, making it a consistent encryption boundary across Cash App's data infrastructure. Block was encrypting more than 8 TB/day across both channels as of December 2024.
Producer and consumer architecture
Producers and consumers in Cash App's primary Kafka workloads use Kafka clients managed either directly by services with the capability to do so, or indirectly through evently-cloud for services that cannot. In the encrypted pipeline, Tink keysets are loaded by producers before serialisation and by consumers before deserialisation, with the key lookup resolved via AWS KMS at access time.
Special techniques and engineering innovations
Topic-level client caching with Istio affinity
The combination of Kafka client caching in evently-cloud and Istio consistent-hashing affinity at the load balancer layer is an effective pattern for bridging Kafka's stateful client model with a stateless HTTP API. The Kafka client's initialisation cost is paid once per topic per pod lifecycle, not once per request. The Istio affinity rule ensures the cached client is reused rather than abandoned when new requests arrive. The 85% pod reduction (100 to 15) is a concrete outcome of applying this pattern.
Data-centric encryption keys scoped to Kafka topics
Most application-layer encryption in distributed systems is service-centric: a service holds a key and uses it for all its interactions. Cash App's model is data-centric: the key belongs to the topic (or the gRPC endpoint), not the service. The implication for Kafka is that topic access control is enforced at the key level as well as at the broker ACL level. Rotating a key revokes access to historical data for any service that cannot access the new key. This is a meaningful security boundary in a financial services context where data sensitivity varies significantly by topic.
Operating Kafka at scale
Deployment model: Cash App's Kafka cluster runs self-managed on AWS. The PubSub platform team operates the cluster alongside SQS and NATS as a unified internal messaging platform. There is no public statement from Cash App about using a managed Kafka service such as Amazon MSK or Confluent Cloud for its primary cluster.
Kafka management UI: The PubSub platform team operates Kpow as the Kafka management interface. This is the same team responsible for provisioning and maintaining the Kafka cluster.
Encryption key management: Tink keysets are stored in S3 and backed by AWS KMS Customer Managed Keys. AWS IAM policies govern which workloads can access which topic keys. The data safety levels framework, documented separately by Block's security team, governs classification and retention policies for data that flows through Kafka.
Developer experience: The evently-cloud service reduces the barrier to Kafka adoption for engineering teams that are not equipped to manage persistent Kafka client connections. Teams that can manage clients directly do so; those that cannot use the REST API. The PubSub platform team maintains both paths.
Service mesh: Istio (Envoy sidecars) handles service-to-service communication for evently-cloud and the surrounding Kubernetes workloads. The migration from per-service ELBs to Istio client-side load balancing was what enabled the request-affinity optimisation that reduced evently-cloud's pod footprint.
Challenges and how they solved them
Kafka client initialisation latency in evently-cloud
Kafka clients take seconds to initialise and seek to specific topic offsets. For a service that creates a new client per request, this overhead dominates response time. The first approach — running more pods so each one handled fewer requests — did not solve the root cause. The actual fix was to cache clients at the pod level and use Istio consistent-hashing to ensure requests for the same topic always route to the same pod. Once each pod's client cache is warm, initialisation overhead disappears from the hot path. Pod count fell from 100 to 15 without SLA regression.
Service-centric encryption creating implicit cross-topic access
In a service-centric encryption model, a service that holds a key can decrypt any data encrypted with that key, regardless of which Kafka topic it came from. If a service's key is compromised, or if the service is misconfigured to consume from a topic it was not intended to read, it can access data it should not. Cash App's response was to move to per-topic keys under a data-centric model. Each Kafka topic has its own Tink keyset; access to that keyset is granted explicitly via AWS IAM. A service can only decrypt data from the topics it has been explicitly granted access to, regardless of what other topics it can reach at the network level.
Full tech stack
Key contributors
- Alec Holmes (with Ryan Hall and Jan Zantinge) — authored "Request Affinity with Istio," covering the evently-cloud architecture and the optimisation that reduced pod count from 100 to 15.
- Yoav Amit, Gelareh Taban, Matthew Miller — authored "Encryption using data-specific keys" (December 2024), documenting the per-topic encryption model across Kafka and gRPC transport.
Key takeaways for your own Kafka implementation
- Kafka client initialisation overhead can be absorbed at a bridge layer. If you have services that need Kafka access but cannot manage persistent client connections, a proxy service with topic-level client caching and request affinity is worth considering. The operational win at Cash App was an 85% reduction in pod count for the same workload.
- Istio consistent-hashing is a practical way to implement topic affinity without a dedicated load balancer per service. Passing the Kafka topic name as an HTTP header and configuring a consistent-hash affinity rule gives you stateful routing in a stateless HTTP layer with minimal infrastructure overhead.
- Per-topic encryption keys are more granular than per-service keys, and the operational overhead is manageable at scale. Cash App encrypts more than 8 TB/day under this model. The main requirement is an IAM structure that grants topic key access explicitly rather than implicitly through service identity.
- Separating Kafka, job queues, and real-time push into distinct primitives simplifies service design. Cash App's PubSub platform offers Kafka for durable streaming, SQS for queue-based job processing, and NATS for sub-second client updates as separate tools with distinct semantics. This avoids the anti-pattern of using Kafka for workloads where its durability guarantees are unnecessary overhead.
Sources and further reading
Primary sources
- Alec Holmes, "Request Affinity with Istio" (Cash App Code Blog): https://code.cash.app/request-affinity-with-istio
- Yoav Amit, Gelareh Taban, Matthew Miller, "Encryption using data-specific keys" (December 2024): https://code.cash.app/encryption-using-data-keys
Try Kpow with your Kafka cluster
If you are monitoring a Kafka cluster at any scale, you can try Kpow free for 30 days. It connects to any Kafka cluster in minutes and deploys via Docker, Helm, or JAR.