Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Goldman Sachs uses Apache Kafka in production

Table of contents

Factor House
May 16th, 2026
xx min read

Goldman Sachs runs Apache Kafka across at least three separate business divisions, each with a different deployment model and a different set of problems to solve. One team operates a resilient on-premises cluster designed around the assumption that failures will always occur. Another migrated its cluster to Amazon MSK to reduce operational overhead. A third uses Kafka as the messaging bus for a payments platform that must process instant transactions with exactly-once guarantees and strict ISO20022 schema compliance.

The result is less a single Kafka story and more three distinct engineering contexts that happen to share the same underlying technology. What connects them is a consistent focus on data integrity and availability in an industry where those properties are non-negotiable.

Company overview

Goldman Sachs is a global investment banking and financial services firm operating across investment banking, asset management, consumer banking, and transaction banking. The scale and latency requirements of financial markets mean the firm has long invested in real-time data infrastructure.

Goldman Sachs's public Kafka work spans at least eight years of documented activity. Anton Gorshkov, Managing Director at Goldman Sachs Asset Management (GSAM), described the firm's on-premises Kafka architecture at QCon New York as early as 2017. In the same year, the Global Investment Research (GIR) division began planning to refactor its on-premises stack, a process that ultimately culminated in a migration to Amazon MSK. The Transaction Banking (TxB) division separately built a payments platform on Amazon MSK and has presented on its monitoring and resiliency testing approach at Kafka Summit Europe 2021.

DateEvent2017Anton Gorshkov presents GSAM Core Platform Kafka architecture at QCon New York and Kafka Summit2018GIR begins refactoring its on-premises technology stack, starting the path toward AWS migration2021Sheikh Araf and Ameya Panse present TxB Kafka monitoring and resiliency approach at Kafka Summit Europe 20212023AWS blog post published detailing GIR's completed migration from on-premises Kafka to Amazon MSK using MirrorMaker 2.0

Goldman Sachs's Kafka use cases

GSAM Core Front Office Platform: financial transaction streaming

The Core Platform team within Goldman Sachs Asset Management uses Kafka as a pub-sub messaging platform for financial transaction events. Downstream consumers receive messages via three pathways: a direct sink to an in-memory RDBMS used for operational troubleshooting, a Spark Streaming job that processes events and feeds a second in-memory RDBMS queried via REST and Vert.x APIs, and a batch ETL job that persists the full event stream to a data lake for audit and governance purposes. Every message carries a globally unique identifier assigned by the upstream service, which makes idempotent replay practical in outage recovery scenarios.

Global Investment Research: publication workflow orchestration

The GIR division uses Kafka as the backbone for orchestrating research publication workflows. When a research document is ready for distribution, Kafka coordinates the handoff between approximately 12 microservices running across 30 instances, routing publications to the correct downstream channels: the client portal, the API, and mobile notifications.

Transaction Banking: instant payments processing

TxB uses Kafka as the inter-process messaging layer between the Payment Gateway and other TxB microservices. The division chose Kafka for this use case in part because Kafka's native producer idempotency and schema registry integration map well to the requirements of instant payments: exactly-once processing guarantees and strict message format validation. In internal lab testing, Kafka outperformed traditional queuing technologies by approximately 175% for this workload.

Scale and throughput

Public scale figures are available for the GSAM Core Front Office Platform. As of 2017, the cluster handled approximately 1.5 TB per week of traffic, with peak production rates reaching around 1,500 messages per second.

For GIR, the cluster served approximately 12 microservices across 30 instances, though no message rate or throughput figures have been published.

No public scale figures are available for the TxB cluster.

Goldman Sachs's Kafka architecture

GSAM Core Front Office Platform

The GSAM platform runs on an on-premises virtualised cluster spanning multiple data centres within the New York City metro area. The team's design principle is to treat multiple co-located data centres as a single redundant logical data centre rather than as primary and backup sites. With inter-site network latency of approximately 4ms between New York City and New Jersey facilities, synchronous replication between sites is feasible and is the approach taken.

A minimum of three in-sync replicas is required at all times. There is no primary/backup distinction and no failover event to manage: all replicas are considered equivalent.

The platform uses the Kafka Connect API to attach downstream consumers to the cluster. Those consumers fall into three categories: a direct RDBMS sink for operational visibility, a Spark Streaming job that feeds a real-time query layer, and a batch ETL sink to a data lake.

Data protection is layered across four mechanisms. Starting from the most granular: synchronous disk-level replication via EMC Symmetrix Remote Data Facility (SRDF), asynchronous replication between sites, nightly batch replication, and tape backup. This arrangement is designed to cover failure scenarios ranging from a single VM loss (estimated to occur 1-5 times per year) to a full data centre outage (modelled as a once-in-20-years event).

Global Investment Research

GIR initially ran an on-premises Kafka cluster and migrated it to Amazon MSK beginning in 2018 and completing the cutover in the 2021-2023 period. Connectivity between the on-premises network and the MSK cluster in AWS uses AWS PrivateLink with Network Load Balancers, routed through a GS Transit Account VPC. Services use the Spring Kafka client library.

Avro is used for message serialisation, with a Schema Registry managing schema evolution. The Schema Registry itself was migrated to AWS as part of the same project, with the _schemas topic replicated via MirrorMaker 2 before the cluster cutover.

Transaction Banking

TxB runs its Payment Gateway on Amazon ECS with AWS Fargate as the compute layer. Amazon MSK handles all inter-process communication between the Payment Gateway and downstream TxB microservices. Schema compliance is enforced via a Schema Registry configured to validate all messages against ISO20022 specifications.

Stream processing

GSAM uses Apache Spark Streaming for stateful processing of events downstream of Kafka, feeding results to in-memory RDBMS instances that serve end-user-facing APIs.

Kafka Connect ecosystem

GSAM uses the Kafka Connect API to connect message sinks to its on-premises cluster. Documented sinks include the in-memory RDBMS and the batch ETL data lake sink. Specific connector implementations have not been published.

Special techniques and engineering innovations

Symmetric multi-datacenter replication without failover

The GSAM platform avoids the operational complexity of primary/backup data centre topologies by treating the NYC metro area as a single logical deployment zone. With synchronous replication and a minimum of three in-sync replicas spread across facilities, there is no failover scenario: if a broker or data centre becomes unavailable, the remaining in-sync replicas continue serving reads and writes without any reconfiguration. The team documented a range of actual failure frequencies: single VM failures happen 1-5 times per year with no impact; simultaneous loss of two VM hosts occurs approximately once a year with processing halted for some topics; a three-host or data centre-level failure is rare enough to be modelled as a once-in-several-years event.

Belt-and-suspenders data protection

Beyond Kafka's built-in replication, the GSAM platform layers three additional data protection mechanisms: asynchronous replication, nightly batch replication, and tape backup. The most granular layer is synchronous disk-level replication via EMC SRDF, which mirrors data between storage arrays in different facilities. The team's stated principle is that failure will always occur, and each protection layer is sized against a specific failure probability.

Globally unique identifiers for idempotent replay

Every message produced to the GSAM cluster is tagged with a globally unique identifier by the upstream service before it enters Kafka. If an outage requires messages to be resent, the identifier allows consumers to detect and discard duplicates without additional coordination or changes to consumer logic.

Atomic cutover migration with MirrorMaker 2

GIR evaluated two migration approaches for its move to Amazon MSK: an atomic cutover with a planned downtime window, and an incremental hybrid migration that would keep both clusters running in parallel. The team chose the atomic approach to avoid rewriting the Spring Kafka library configuration across all 12 services. The migration sequence was: replicate the Avro Schema Registry using a unidirectional MirrorMaker 2 stream, replicate all topics using a custom MM2 replication policy that stripped the default topic-name prefixes, stop all services, reconfigure DNS endpoints from on-premises brokers to MSK brokers, restart services, and validate end-to-end. Total downtime was approximately 2 hours within a 7-hour migration window.

ISO20022 schema validation with fail-fast enforcement

TxB enforces ISO20022 message format compliance at the producer and consumer level via Schema Registry integration. If a producer or consumer presents a message that does not conform to the registered schema, it is rejected before any further processing occurs. The team chose this fail-fast behaviour deliberately: in a payments context, processing a malformed payment message would have a worse outcome than rejecting it at the boundary.

Kafka producer idempotency for exactly-once payments

TxB uses Kafka's native producer idempotency feature to achieve exactly-once delivery semantics for payment messages. The team notes that traditional queuing technologies do not provide native producer idempotency, making Kafka a better fit for this specific requirement without additional application-level deduplication logic.

Operating Kafka at scale

GSAM monitoring

The GSAM platform exposes a REST service that provides cluster insights: topic metadata, consumer lag, and in-sync replica counts. Metrics are collected at three levels: application, JVM, and infrastructure. All metrics feed into a time-series database with centralised alerting.

TxB monitoring and resiliency testing

TxB monitors its Kafka clusters using DataDog dashboards. A JMX agent sidecar collects metrics from producers and consumers, covering error rates, connection rates, latencies, and consumer lag, giving the team a live view across the entire service footprint.

Cluster health is additionally tracked by a custom heartbeat application that generates alerts in DataDog when it detects degraded cluster state.

Beyond monitoring, TxB runs regular game days where the team simulates various failure scenarios across the full client infrastructure to validate recovery behaviour and improve availability characteristics before those failures occur in production.

Deployment models

GSAM Core Platform runs self-managed Kafka on virtualised on-premises infrastructure. GIR migrated to Amazon MSK after 2018. TxB runs on Amazon MSK with compute on Amazon ECS and AWS Fargate.

Challenges and how they solved them

MirrorMaker 2 flush timeout too short during GIR migration

During topic replication in the GIR migration, the default MirrorMaker 2 flush timeout of 5 seconds caused failures. The team increased it to 30 seconds to accommodate the volume of data being replicated.

Message size limits exceeded in transit

Default settings for max.request.size and max.message.bytes were too small for some GIR messages during replication. Both parameters were increased in the MSK and MirrorMaker 2 configuration before the full cutover.

Network Load Balancer idle timeout shorter than Kafka client timeout

AWS Network Load Balancers have a default idle connection timeout of 350 seconds. The Kafka client default for connections.max.idle.ms is 540 seconds, meaning connections would be silently dropped by the NLB before the Kafka client detected and re-established them. GIR resolved this by setting connections.max.idle.ms to a value below 350 seconds on all client services.

Full tech stack

Category Technology Notes
Message broker Apache Kafka (on-premises, virtualised) GSAM Core Front Office Platform
Message broker Amazon MSK GIR (post-migration) and TxB instant payments platform
Schema registry Avro + Schema Registry Message serialisation and schema governance in GIR and TxB; ISO20022 validation in TxB
Stream processing Apache Spark Streaming Stateful processing downstream of Kafka in GSAM; feeds in-memory RDBMS query layer
Connectors Kafka Connect Message sink connectors in GSAM Core Platform
Migration tooling Apache Kafka MirrorMaker 2.0 Topic and schema registry replication during GIR on-prem to MSK migration
Compute Amazon ECS + AWS Fargate Payment Gateway container compute in TxB
Networking AWS PrivateLink + Network Load Balancers Secure connectivity between on-premises network and MSK during GIR migration
Kafka client library Spring Kafka Used across GIR microservices
Monitoring DataDog Cluster monitoring dashboards and alerting in TxB
Metrics collection JMX agent sidecar Producer and consumer metrics collection in TxB
Storage replication EMC Symmetrix Remote Data Facility (SRDF) Synchronous disk-level replication between GSAM on-premises sites
Storage sinks In-memory RDBMS Real-time and troubleshooting query layer downstream of Kafka in GSAM
API layer Vert.x / REST APIs Application API layer serving data from in-memory RDBMS in GSAM

Key contributors

Name Title / team Contribution
Anton Gorshkov Managing Director, GSAM Core Platform Presented GSAM Kafka architecture and resilience design at QCon New York 2017 and Kafka Summit
Sheikh Araf Associate, Transaction Banking Presented TxB Kafka monitoring and resiliency testing at Kafka Summit Europe 2021
Ameya Panse Associate, Transaction Banking Presented TxB Kafka monitoring and resiliency testing at Kafka Summit Europe 2021
Zachary Whitford Associate, Global Investment Research Co-authored AWS blog post on GIR on-premises to MSK migration (2023)
Richa Prajapati Vice President, Global Investment Research Co-authored AWS blog post on GIR on-premises to MSK migration (2023)
Aldo Piddiu Vice President, Global Investment Research Co-authored AWS blog post on GIR on-premises to MSK migration (2023)

Key takeaways for your own Kafka implementation

  • Model your failure probability explicitly before designing replication. The GSAM team published specific failure rates for single-VM, dual-VM, and data centre-level failures. That grounding in actual observed frequencies is what justified the layered redundancy approach rather than over- or under-engineering it.
  • Treat multi-datacenter latency as a design input, not a constraint. The GSAM decision to use synchronous replication across NYC metro sites is only viable because the inter-site latency is approximately 4ms. If you are evaluating multi-site synchronous replication, measure your actual latency before committing to the topology.
  • Choose your migration strategy based on what you are willing to rewrite. GIR picked atomic cutover specifically to avoid modifying Spring Kafka configuration across 12 services. If your codebase can accommodate a gradual migration, the incremental approach gives you more rollback options. If it cannot, atomic cutover with thorough pre-migration replication may be simpler.
  • Network Load Balancer idle timeouts will silently drop Kafka connections. If you are running Kafka clients behind AWS NLBs, set connections.max.idle.ms below the NLB idle timeout rather than relying on the Kafka default of 540 seconds.
  • Schema enforcement at the boundary is a valid trade-off in high-stakes pipelines. TxB's fail-fast approach rejects non-compliant payment messages before any processing occurs. For most event streaming workloads this would be too strict, but in payments, a rejected message is recoverable; a processed malformed payment is not.

Sources and further reading

If you are managing Kafka clusters across multiple teams or deployment environments, Kpow gives you a single view of broker health, consumer lag, schema registry state, and topic activity. You can connect it to any Kafka cluster in minutes and try it free for 30 days.