
How JPMorgan uses Apache Kafka in production
Table of contents
JPMorgan Chase runs one of the largest disclosed Kafka deployments in financial services: 102 clusters, 510 nodes, 13,000 topics, and 1.5 PB of configured storage — figures the firm's engineers shared publicly at Kafka Summit San Francisco in 2019. By 2021, that platform was ingesting 400 billion events per day in production.
The engineering problem at the centre of the deployment is coordination at firm scale: tens of thousands of applications across business lines, geographies, and regulatory boundaries, all needing reliable, low-latency event exchange without each team managing its own messaging infrastructure.
Company overview
JPMorgan Chase & Co. is a global financial services firm operating across investment banking, commercial banking, financial transaction processing, asset management, and retail banking. The firm employs more than 300,000 people and processes trillions of dollars in transactions annually, making data consistency, latency, and auditability first-order engineering constraints.
Apache Kafka entered JPMorgan Chase's platform as part of a broader shift toward microservices and event-driven architecture. The firm's CTO, Andrew J. Lang, described the goal as creating "a digital nervous system" connecting disconnected and siloed systems at scale. By January 2019, the firm publicly recognised Confluent's Kafka distribution as part of its Hall of Innovation programme, confirming the Confluent Platform as the foundation of its streaming infrastructure.
JPMorgan Chase's Kafka use cases
JPMorgan Chase's Kafka adoption started with straightforward infrastructure concerns and expanded into a firm-wide event backbone as the platform matured.
Log aggregation and metrics transport were the earliest use cases. Before the platform broadened in scope, teams used Kafka to consolidate operational logs and carry metadata and monitoring signals across services — a pattern that provided enough operational value to justify the investment before more complex streaming architectures were in place.
Microservices event bus is now the primary use case at firm scale. JPMorgan Chase operates approximately 8,000 microservices built on an internal framework called Photon. Kafka acts as the communications layer between those services, with Photon providing client-side resiliency on top of the open-source Kafka drivers: automatic consumer failover and a "phone-home" mechanism that reports runtime configurations (including batch size) to simplify production diagnosis.
Near real-time stream processing and data moving pipelines expanded Kafka's role beyond service-to-service messaging. The firm uses Kafka to move data across cloud environments and datacentres in near real-time, supporting fan-in (many producers, centralised processing) and fan-out (one source, many downstream consumers) patterns. By 2021, this also encompassed data mesh architecture, with Kafka Connect serving as the data movement layer between domain-owned data stores.
Tenant-level mirroring extends Kafka's reach across datacentres for teams that need their topics replicated geographically, either for disaster recovery or to serve consumers in different regions.
Scale and throughput
The scale figures JPMorgan Chase shared at Kafka Summit 2019 give a concrete picture of what "enterprise Kafka" looks like in a regulated financial institution.
The ratio of total to production clusters (102 to 40) reflects the firm's multi-environment discipline: non-production clusters for development, testing, and staging are tracked and governed under the same platform rather than being managed ad hoc by individual teams. The gap between total topics (13,000) and production topics (1,300) tells a similar story: provisioning is self-service and relatively unconstrained, but production promotion is controlled.
JPMorgan Chase's Kafka architecture
Cluster topology and multi-availability-zone design
Each Kafka cluster runs on five nodes with a replication factor of four. This configuration tolerates the simultaneous failure of two nodes, which matters in a regulated environment where cluster availability affects downstream financial processes. Every node in a cluster runs a dedicated ZooKeeper ensemble alongside the Kafka brokers, Schema Registry instances, and monitoring agents, keeping coordination local to the cluster rather than shared across the estate.
For clusters requiring higher fault tolerance, JPMorgan Chase uses a 2.5 AZ stretch cluster configuration: brokers are organised into four logical racks across two availability zones (two racks per AZ), with replication factor four and a minimum in-sync replica count of three. ZooKeeper quorum is placed in a separate region to ensure it remains available even if one AZ is lost. This setup provides continued write availability during a single AZ failure without requiring manual intervention.
Multi-tenancy and the control plane
Running 100+ clusters for 400 applications across a large enterprise requires a systematic approach to isolation and governance. JPMorgan Chase addresses this through logical namespace-based multi-tenancy: each application team gets a namespace that abstracts the underlying physical cluster. Entitlements, governance policies, and quota limits are enforced at the namespace level, while the physical infrastructure is shared.
A centralised, data-driven control plane sits above the clusters and provides a self-service API layer for topic provisioning, schema registration, Kafka Streams deployment, and quota management. All Kafka artifacts created by an application are maintained within that application's namespace, making ownership explicit. The control plane also exposes a centralised data lineage view derived from the namespace metadata.
Kafka Connect architecture
JPMorgan Chase built a managed Kafka Connect service to handle near real-time data integration between Kafka and external systems. The design centres on a non-obvious deployment decision: instead of hosting Kafka Connect instances on shared provider infrastructure, the firm deploys Connect instances within each customer's own Kubernetes namespace.
This placement solves the credential problem. When a Connect instance sits on shared infrastructure, source and sink systems must expose their credentials to the service provider. Deploying Connect into the customer namespace means the connectors authenticate to source and sink systems using the customer's own credentials, with no cross-boundary credential sharing required. Each deployment gets a unique HTTPS URL (for example, connect1.jpmchase.net), along with dedicated configuration, offset, and status topics.
A centralised and federated control plane manages all Connect deployments across namespaces, keeping operational visibility consistent even as the deployment boundary sits inside each team's Kubernetes environment.
Multi-datacenter replication
JPMorgan Chase uses MirrorMaker 2.0 for cross-datacenter topic replication. Both active-active and active-passive configurations are in use, with configurable topic patterns determining which topics are replicated and in which direction. Active-active replication supports geographically distributed consumers; active-passive configurations serve disaster recovery scenarios where a standby cluster needs to stay current.
Special techniques and engineering innovations
Connection Profiles abstract cluster addressing from application configuration. Rather than configuring a Kafka client with a list of broker addresses, applications reference a profile name (such as "NAD1700"). The control plane resolves that profile to the current broker addresses at runtime. When infrastructure changes — broker replacements, cluster migrations, IP reassignments — applications require no reconfiguration. The profile name is the stable handle.
Health Index is a metrics-based cluster availability score computed from broker and partition health signals and fed into Prometheus. When a cluster's Health Index falls below a threshold, it is excluded from routing, and application-level resiliency controls react accordingly. This creates an automated circuit-breaker layer that sits between raw Kafka metrics and consumer behaviour without requiring each application team to implement their own health logic.
Custom Schema Registry authorisation extension adds operation-level access control to the Schema Registry REST API. A REST extension intercepts incoming requests and separates read operations (GET) from write operations (POST, PUT, DELETE), enforcing role-appropriate permissions before requests reach the schema storage layer. This prevents teams from unintentionally overwriting schemas owned by other namespaces while keeping read access open for discovery.
Federated ADFS OAuth integrates JPMorgan Chase's existing Active Directory Federation Services identity provider with Kafka's pluggable SASL/OAUTHBEARER authentication (KIP-255). Rather than managing a separate identity system for Kafka, the firm routes authentication through its existing corporate identity infrastructure, which simplifies credential lifecycle management and satisfies audit requirements around identity traceability.
Orchestrated broker patching replaces manual rolling restarts with an automated lifecycle management system. The process sequences broker updates, validates cluster health between each step, and monitors for under-replicated partitions before proceeding. If a step introduces replication lag or partition leadership instability, patching pauses until the cluster recovers, reducing the risk of data unavailability during routine maintenance.
Photon Framework Kafka client resiliency addresses gaps in the open-source Kafka client libraries for consumer failover scenarios. Photon, the internal Spring-Boot microservices framework, wraps the Kafka client with automatic failover logic and a "phone-home" capability that surfaces runtime configuration values — including batch size settings that commonly affect consumer throughput — for use in production diagnosis without requiring a service restart.
Operating Kafka at scale
JPMorgan Chase runs Confluent Platform on-premises in a hybrid configuration. The operational model is built around the centralised control plane rather than per-cluster management.
Self-service provisioning is the default path for teams. Topic creation, schema registration, quota adjustments, and Kafka Streams deployment all go through the control plane's self-service API, keeping infrastructure teams out of routine provisioning workflows. Governance constraints — topic size limits, throughput quotas, namespace isolation — are encoded in the control plane and enforced automatically, so teams work within boundaries without needing to understand the physical cluster topology.
Metrics and observability rely on Prometheus for data collection. The Health Index pipeline feeds cluster-level signals into Prometheus, and the Photon Framework produces standard metrics and distributed traces across all Kafka producers and consumers. This gives platform operators a consistent observability interface across the 8,000 microservices that use Kafka, rather than requiring each team to instrument their Kafka clients independently.
Tenant quota enforcement is per-namespace: topic retention limits and producer/consumer throughput quotas are set at the namespace level and enforced by the control plane. All Kafka artifacts are owned by the namespace that created them, making quota attribution and governance auditing straightforward.
Managed Kafka Connect operations are handled through a federated control plane that spans all Connect deployments across customer namespaces. Each deployment has its own scoped configuration, offset, and status topics, so an issue in one deployment's offset management does not affect others. The federated control plane provides operators with a consistent view of all Connect deployments without requiring direct access to each customer namespace.
Challenges and how they solved them
The Kafka Connect credential problem. When Kafka Connect runs on shared infrastructure, connectors that need to read from or write to source and sink systems must present credentials to a service provider outside the team's trust boundary. For a regulated financial institution, this is unacceptable: credentials for core banking systems cannot be handed to a shared service operator. JPMorgan Chase's resolution was to shift the deployment boundary: Connect instances run inside the customer's Kubernetes namespace, so all authentication to source and sink systems uses credentials that remain within the customer's control perimeter throughout.
Multi-tenant Kafka Streams application ID collisions. In a shared Kafka environment, Kafka Streams applications identify themselves using an application ID, which is also used to namespace internal topics (repartition and changelog topics). If two teams in different namespaces choose the same application ID, their internal topics collide. JPMorgan Chase identified this as an operational risk in a multi-tenant deployment at their scale. The control plane's namespace model and self-service API provide the framework for enforcing application ID uniqueness within and across namespaces, though the specific enforcement mechanism was not detailed in public presentations.
Offset management across replicated clusters. Active-passive replication with MirrorMaker 2.0 introduces a consumer group offset synchronisation challenge: offsets in the primary cluster do not translate directly to offsets in the replica cluster, because topic partition assignments and log segment layouts may differ. JPMorgan Chase identified this as an operational complexity at their scale, particularly for clusters used in disaster recovery scenarios where consumers may need to resume from a specific position in the replica.
Operational burden at 100+ clusters. Managing individual clusters becomes untenable at the scale JPMorgan Chase operates. Monitoring, patching, quota enforcement, and health assessment across 100 clusters requires automation rather than manual workflows. The Health Index, centralised control plane, and orchestrated patching pipeline were all built in response to this operational pressure rather than as upfront design decisions.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Separate your deployment boundary from your management boundary. JPMorgan Chase deploys Kafka Connect inside customer namespaces but manages all deployments from a centralised control plane. This keeps credentials within the team's trust perimeter while preserving operational visibility for the platform team — a pattern worth considering any time a shared service needs access to systems that require privileged credentials.
- Build indirection into client configuration early. Connection Profiles — stable name references that resolve to broker addresses at runtime — mean that infrastructure changes (IP changes, broker migrations, cluster splits) have no impact on application configuration. If you hard-code broker addresses into application configs today, you will pay the cost of that decision at every infrastructure change.
- Encode governance as code in the control plane, not in process. Topic size limits, throughput quotas, and namespace isolation enforced automatically by a control plane are more reliable than guidelines enforced by review. At scale, the gap between what teams are supposed to do and what they actually do widens; a control plane that makes the wrong thing impossible is more durable than one that makes the right thing easy.
- Design your cluster health model before you need it. JPMorgan Chase's Health Index was built in response to the operational burden of managing 100+ clusters. A computed availability score that drives routing decisions is more useful than raw metrics alone, but it requires you to define what "healthy enough to route to" means for your workloads before you are in a position where that definition matters urgently.
- Multi-tenant Kafka Streams requires namespace-scoped application ID enforcement. Application ID collisions in a shared cluster cause internal topic conflicts that are difficult to diagnose and resolve without downtime. If you are offering Kafka Streams as a shared platform capability, enforce application ID uniqueness at the provisioning layer rather than relying on teams to coordinate independently.
Sources and further reading
- Vishnu Balusu and Ashok Kadambala (JP Morgan Chase): "Secure Kafka at scale in true multi-tenant environment" — Kafka Summit San Francisco 2019
- Ashok Kadambala and Shreesha Hebbar (JP Morgan Chase): "Changing landscapes in data integration — Kafka Connect for near real-time data moving pipelines" — Kafka Summit Americas 2021
- Confluent: "Confluent inducted into JPMorgan Chase Hall of Innovation" — January 2019
- Haiying Guo and Vidya Meyyappan (JP Morgan Chase): "Driving native cloud adoption at scale through a microservice framework" — Next at Chase, 2021
If you are managing a Kafka deployment at a similar scale and want visibility into consumer lag, broker health, and topic configuration across your clusters, Kpow gives you a single interface for monitoring and managing Apache Kafka. You can connect it to any Kafka cluster in minutes and try it free for 30 days.