
How Goldman Sachs uses Apache Kafka in production
Table of contents
Goldman Sachs runs Apache Kafka across at least three separate business divisions, each with a different deployment model and a different set of problems to solve. One team operates a resilient on-premises cluster designed around the assumption that failures will always occur. Another migrated its cluster to Amazon MSK to reduce operational overhead. A third uses Kafka as the messaging bus for a payments platform that must process instant transactions with exactly-once guarantees and strict ISO20022 schema compliance.
The result is less a single Kafka story and more three distinct engineering contexts that happen to share the same underlying technology. What connects them is a consistent focus on data integrity and availability in an industry where those properties are non-negotiable.
Company overview
Goldman Sachs is a global investment banking and financial services firm operating across investment banking, asset management, consumer banking, and transaction banking. The scale and latency requirements of financial markets mean the firm has long invested in real-time data infrastructure.
Goldman Sachs's public Kafka work spans at least eight years of documented activity. Anton Gorshkov, Managing Director at Goldman Sachs Asset Management (GSAM), described the firm's on-premises Kafka architecture at QCon New York as early as 2017. In the same year, the Global Investment Research (GIR) division began planning to refactor its on-premises stack, a process that ultimately culminated in a migration to Amazon MSK. The Transaction Banking (TxB) division separately built a payments platform on Amazon MSK and has presented on its monitoring and resiliency testing approach at Kafka Summit Europe 2021.
DateEvent2017Anton Gorshkov presents GSAM Core Platform Kafka architecture at QCon New York and Kafka Summit2018GIR begins refactoring its on-premises technology stack, starting the path toward AWS migration2021Sheikh Araf and Ameya Panse present TxB Kafka monitoring and resiliency approach at Kafka Summit Europe 20212023AWS blog post published detailing GIR's completed migration from on-premises Kafka to Amazon MSK using MirrorMaker 2.0
Goldman Sachs's Kafka use cases
GSAM Core Front Office Platform: financial transaction streaming
The Core Platform team within Goldman Sachs Asset Management uses Kafka as a pub-sub messaging platform for financial transaction events. Downstream consumers receive messages via three pathways: a direct sink to an in-memory RDBMS used for operational troubleshooting, a Spark Streaming job that processes events and feeds a second in-memory RDBMS queried via REST and Vert.x APIs, and a batch ETL job that persists the full event stream to a data lake for audit and governance purposes. Every message carries a globally unique identifier assigned by the upstream service, which makes idempotent replay practical in outage recovery scenarios.
Global Investment Research: publication workflow orchestration
The GIR division uses Kafka as the backbone for orchestrating research publication workflows. When a research document is ready for distribution, Kafka coordinates the handoff between approximately 12 microservices running across 30 instances, routing publications to the correct downstream channels: the client portal, the API, and mobile notifications.
Transaction Banking: instant payments processing
TxB uses Kafka as the inter-process messaging layer between the Payment Gateway and other TxB microservices. The division chose Kafka for this use case in part because Kafka's native producer idempotency and schema registry integration map well to the requirements of instant payments: exactly-once processing guarantees and strict message format validation. In internal lab testing, Kafka outperformed traditional queuing technologies by approximately 175% for this workload.
Scale and throughput
Public scale figures are available for the GSAM Core Front Office Platform. As of 2017, the cluster handled approximately 1.5 TB per week of traffic, with peak production rates reaching around 1,500 messages per second.
For GIR, the cluster served approximately 12 microservices across 30 instances, though no message rate or throughput figures have been published.
No public scale figures are available for the TxB cluster.
Goldman Sachs's Kafka architecture
GSAM Core Front Office Platform
The GSAM platform runs on an on-premises virtualised cluster spanning multiple data centres within the New York City metro area. The team's design principle is to treat multiple co-located data centres as a single redundant logical data centre rather than as primary and backup sites. With inter-site network latency of approximately 4ms between New York City and New Jersey facilities, synchronous replication between sites is feasible and is the approach taken.
A minimum of three in-sync replicas is required at all times. There is no primary/backup distinction and no failover event to manage: all replicas are considered equivalent.
The platform uses the Kafka Connect API to attach downstream consumers to the cluster. Those consumers fall into three categories: a direct RDBMS sink for operational visibility, a Spark Streaming job that feeds a real-time query layer, and a batch ETL sink to a data lake.
Data protection is layered across four mechanisms. Starting from the most granular: synchronous disk-level replication via EMC Symmetrix Remote Data Facility (SRDF), asynchronous replication between sites, nightly batch replication, and tape backup. This arrangement is designed to cover failure scenarios ranging from a single VM loss (estimated to occur 1-5 times per year) to a full data centre outage (modelled as a once-in-20-years event).
Global Investment Research
GIR initially ran an on-premises Kafka cluster and migrated it to Amazon MSK beginning in 2018 and completing the cutover in the 2021-2023 period. Connectivity between the on-premises network and the MSK cluster in AWS uses AWS PrivateLink with Network Load Balancers, routed through a GS Transit Account VPC. Services use the Spring Kafka client library.
Avro is used for message serialisation, with a Schema Registry managing schema evolution. The Schema Registry itself was migrated to AWS as part of the same project, with the _schemas topic replicated via MirrorMaker 2 before the cluster cutover.
Transaction Banking
TxB runs its Payment Gateway on Amazon ECS with AWS Fargate as the compute layer. Amazon MSK handles all inter-process communication between the Payment Gateway and downstream TxB microservices. Schema compliance is enforced via a Schema Registry configured to validate all messages against ISO20022 specifications.
Stream processing
GSAM uses Apache Spark Streaming for stateful processing of events downstream of Kafka, feeding results to in-memory RDBMS instances that serve end-user-facing APIs.
Kafka Connect ecosystem
GSAM uses the Kafka Connect API to connect message sinks to its on-premises cluster. Documented sinks include the in-memory RDBMS and the batch ETL data lake sink. Specific connector implementations have not been published.
Special techniques and engineering innovations
Symmetric multi-datacenter replication without failover
The GSAM platform avoids the operational complexity of primary/backup data centre topologies by treating the NYC metro area as a single logical deployment zone. With synchronous replication and a minimum of three in-sync replicas spread across facilities, there is no failover scenario: if a broker or data centre becomes unavailable, the remaining in-sync replicas continue serving reads and writes without any reconfiguration. The team documented a range of actual failure frequencies: single VM failures happen 1-5 times per year with no impact; simultaneous loss of two VM hosts occurs approximately once a year with processing halted for some topics; a three-host or data centre-level failure is rare enough to be modelled as a once-in-several-years event.
Belt-and-suspenders data protection
Beyond Kafka's built-in replication, the GSAM platform layers three additional data protection mechanisms: asynchronous replication, nightly batch replication, and tape backup. The most granular layer is synchronous disk-level replication via EMC SRDF, which mirrors data between storage arrays in different facilities. The team's stated principle is that failure will always occur, and each protection layer is sized against a specific failure probability.
Globally unique identifiers for idempotent replay
Every message produced to the GSAM cluster is tagged with a globally unique identifier by the upstream service before it enters Kafka. If an outage requires messages to be resent, the identifier allows consumers to detect and discard duplicates without additional coordination or changes to consumer logic.
Atomic cutover migration with MirrorMaker 2
GIR evaluated two migration approaches for its move to Amazon MSK: an atomic cutover with a planned downtime window, and an incremental hybrid migration that would keep both clusters running in parallel. The team chose the atomic approach to avoid rewriting the Spring Kafka library configuration across all 12 services. The migration sequence was: replicate the Avro Schema Registry using a unidirectional MirrorMaker 2 stream, replicate all topics using a custom MM2 replication policy that stripped the default topic-name prefixes, stop all services, reconfigure DNS endpoints from on-premises brokers to MSK brokers, restart services, and validate end-to-end. Total downtime was approximately 2 hours within a 7-hour migration window.
ISO20022 schema validation with fail-fast enforcement
TxB enforces ISO20022 message format compliance at the producer and consumer level via Schema Registry integration. If a producer or consumer presents a message that does not conform to the registered schema, it is rejected before any further processing occurs. The team chose this fail-fast behaviour deliberately: in a payments context, processing a malformed payment message would have a worse outcome than rejecting it at the boundary.
Kafka producer idempotency for exactly-once payments
TxB uses Kafka's native producer idempotency feature to achieve exactly-once delivery semantics for payment messages. The team notes that traditional queuing technologies do not provide native producer idempotency, making Kafka a better fit for this specific requirement without additional application-level deduplication logic.
Operating Kafka at scale
GSAM monitoring
The GSAM platform exposes a REST service that provides cluster insights: topic metadata, consumer lag, and in-sync replica counts. Metrics are collected at three levels: application, JVM, and infrastructure. All metrics feed into a time-series database with centralised alerting.
TxB monitoring and resiliency testing
TxB monitors its Kafka clusters using DataDog dashboards. A JMX agent sidecar collects metrics from producers and consumers, covering error rates, connection rates, latencies, and consumer lag, giving the team a live view across the entire service footprint.
Cluster health is additionally tracked by a custom heartbeat application that generates alerts in DataDog when it detects degraded cluster state.
Beyond monitoring, TxB runs regular game days where the team simulates various failure scenarios across the full client infrastructure to validate recovery behaviour and improve availability characteristics before those failures occur in production.
Deployment models
GSAM Core Platform runs self-managed Kafka on virtualised on-premises infrastructure. GIR migrated to Amazon MSK after 2018. TxB runs on Amazon MSK with compute on Amazon ECS and AWS Fargate.
Challenges and how they solved them
MirrorMaker 2 flush timeout too short during GIR migration
During topic replication in the GIR migration, the default MirrorMaker 2 flush timeout of 5 seconds caused failures. The team increased it to 30 seconds to accommodate the volume of data being replicated.
Message size limits exceeded in transit
Default settings for max.request.size and max.message.bytes were too small for some GIR messages during replication. Both parameters were increased in the MSK and MirrorMaker 2 configuration before the full cutover.
Network Load Balancer idle timeout shorter than Kafka client timeout
AWS Network Load Balancers have a default idle connection timeout of 350 seconds. The Kafka client default for connections.max.idle.ms is 540 seconds, meaning connections would be silently dropped by the NLB before the Kafka client detected and re-established them. GIR resolved this by setting connections.max.idle.ms to a value below 350 seconds on all client services.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Model your failure probability explicitly before designing replication. The GSAM team published specific failure rates for single-VM, dual-VM, and data centre-level failures. That grounding in actual observed frequencies is what justified the layered redundancy approach rather than over- or under-engineering it.
- Treat multi-datacenter latency as a design input, not a constraint. The GSAM decision to use synchronous replication across NYC metro sites is only viable because the inter-site latency is approximately 4ms. If you are evaluating multi-site synchronous replication, measure your actual latency before committing to the topology.
- Choose your migration strategy based on what you are willing to rewrite. GIR picked atomic cutover specifically to avoid modifying Spring Kafka configuration across 12 services. If your codebase can accommodate a gradual migration, the incremental approach gives you more rollback options. If it cannot, atomic cutover with thorough pre-migration replication may be simpler.
- Network Load Balancer idle timeouts will silently drop Kafka connections. If you are running Kafka clients behind AWS NLBs, set
connections.max.idle.msbelow the NLB idle timeout rather than relying on the Kafka default of 540 seconds. - Schema enforcement at the boundary is a valid trade-off in high-stakes pipelines. TxB's fail-fast approach rejects non-compliant payment messages before any processing occurs. For most event streaming workloads this would be too strict, but in payments, a rejected message is recoverable; a processed malformed payment is not.
Sources and further reading
- When Streams Fail: Implementing a Resilient Apache Kafka Cluster at Goldman Sachs — Anton Gorshkov, InfoQ / QCon New York 2017
- When Streams Fail: Kafka Off the Shore — Anton Gorshkov, InfoQ / QCon New York 2017 (video presentation)
- How Goldman Sachs migrated from their on-premises Apache Kafka cluster to Amazon MSK — Zachary Whitford, Richa Prajapati, Aldo Piddiu, AWS Big Data Blog, March 2023
- Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs — Sheikh Araf and Ameya Panse, Kafka Summit Europe 2021
- How Transaction Banking Implemented an Instant Payment Architecture for High-Throughput, Low-Latency and 24x7 Availability — Goldman Sachs Developer Blog
If you are managing Kafka clusters across multiple teams or deployment environments, Kpow gives you a single view of broker health, consumer lag, schema registry state, and topic activity. You can connect it to any Kafka cluster in minutes and try it free for 30 days.