
How Salesforce uses Apache Kafka in production
Table of contents
Salesforce has published more detail about its Kafka operations than most companies its size, which makes its engineering blog an unusually complete record of what it takes to run Apache Kafka at genuine enterprise scale. By 2021, the company's fleet had grown to 100+ clusters, 2,500+ brokers, 50,000+ topics, 300,000+ partitions, 40+ PB of storage, and throughput exceeding 3 trillion events per day — a figure that had been doubling annually.
The engineering problem Kafka solves at Salesforce is not a single use case but a platform-level one: how do you provide a unified, reliable event transport across a multi-tenant SaaS estate that spans hundreds of global data centres, hundreds of thousands of customer orgs, and applications ranging from operational metrics to real-time AI?
Company overview
Salesforce is a cloud-based CRM and enterprise software company founded in 1999 and headquartered in San Francisco. It serves more than 150,000 businesses globally across its Sales Cloud, Service Cloud, Marketing Cloud, and Agentforce AI product lines.
Salesforce adopted Kafka in production no later than May 2014, when it hosted an Apache Kafka edition of the SF Logging Meetup at its headquarters. By June 2014, Rajasekar Elango, a lead developer on the Monitoring and Management Team, was presenting the company's early Kafka setup at a public Kafka Meetup — a 3-node ZooKeeper, 5-broker cluster with SSL/TLS mutual authentication and Avro serialisation.
The trigger for adoption was operational observability: the DVA (Diagnostics, Visibility, and Analytics) team needed a scalable transport layer for metrics and logs across a growing global data centre footprint.
Salesforce's Kafka use cases
Salesforce's use cases cover the full breadth of what Kafka is typically used for in a large enterprise, from infrastructure telemetry to customer-facing products to AI pipelines.
Operational metrics and system health (project Ajna)
The DVA team built an internal Kafka-as-a-Service platform code-named Ajna to transport CPU metrics, machine reachability data, system logs, network flow data, application logs, JMX metrics, and custom database metrics across all global data centres to near-real-time dashboards. Ajna is the backbone from which most other Kafka use cases at Salesforce have grown.
Source: Nishant Gupta, Salesforce Engineering Blog
Log shipping pipeline
The Infrastructure group built a next-generation log pipeline on top of Ajna, routing logs from local per-data-centre Kafka clusters through cross-WAN replication to an aggregate cluster in the Secure Zone, and from there into DeepSea (an internal Hadoop/HDFS store) via a MapReduce consumer job called Kafka Camus. The target was 5-nines completeness; it eventually reached 7 nines.
Source: Sanjeev Sahu, Salesforce Engineering Blog
Platform Events and Change Data Capture
Salesforce's Platform Events and Change Data Capture products expose a time-ordered, immutable event stream to hundreds of thousands of multi-tenant customer orgs. Kafka's log-based storage model directly inspired the design: events are offset-based, durable, and replayable. Serialisation uses Apache Avro, with event definitions driven by Salesforce's metadata layer.
Source: Alexey Syomichev, Salesforce Engineering Blog
Real-time ML insights for Einstein Activity Capture
The Activity Platform ingests 100+ million customer interactions per day through Kafka. A chain of Kafka Streams jobs processes the stream: a Gatekeeper spam filter passes clean events to a series of extractors (TensorFlow-based, regex/keyword, Spark ML, and Einstein Conversation Insights), which feed a transformer and then a persistor that writes sales activity insights to Cassandra.
Source: Rohit Deshpande, Salesforce Engineering Blog
Service Cloud chatbot audit and debug
Apache Kafka on Heroku backs the chatbot event pipeline for Service Cloud, targeting tens of millions of debug events daily with up to 7-day retention. The pipeline applies five fault-tolerance patterns on top of Kafka, described further in the special techniques section below.
Source: Mark Holton, Salesforce Engineering Blog
Hyperforce perimeter telemetry
Salesforce's Hyperforce infrastructure generates 60 TB of raw perimeter telemetry daily from commercial CDNs. Spark Streaming normalises this data into protobuf format; Kafka then carries it to an Imply/Druid hypercube (18 dimensions, 13 measurements) serving real-time debugging dashboards for live-site incidents, latency analysis, and DDoS detection with sub-second query latency and 2-minute data freshness.
Source: Srinivas Ranganathan, Salesforce Engineering Blog
Conversational AI context storage for Agentforce
The Conversation Storage Service (CSS) uses Kafka for buffering, batching, and ordering conversational context that powers real-time AI systems — sentiment analysis, agent assist, and supervisor insights — across 50,000+ concurrent conversations at a peak ingestion rate of 30,000+ events per minute. The scaling target is 100,000 concurrent conversations.
Source: Ashima Kochar and Deepak Mali, Salesforce Engineering Blog
Agentforce AI agent audit trail
Kafka pub-sub ingestion handles the variable, spike-prone traffic from 20 million model interactions per month across 500+ enterprise customers, feeding a 30-day audit data retention pipeline backed by Salesforce Data Cloud and S3.
Source: Madhavi Kavathekar, Salesforce Engineering Blog
Cross-cluster replication
Salesforce replaced Apache MirrorMaker with its own open-source tool, Mirus, for cross-data-centre replication. Mirus is built on the Kafka Connect source connector API and has been in production since April 2018.
Source: Paul Davidson, Salesforce Engineering Blog
Scale and throughput
The fleet figures below are from the Kafka Summit Americas 2021 talk by Lei Ye and Paul Davidson, supplemented with per-system figures from individual engineering blog posts.
Salesforce's Kafka architecture
Ajna: Kafka-as-a-Service
Ajna is Salesforce's internal Kafka platform. Each production data centre runs an "Ajna Local" cluster for ingest. Data replicates over WAN to an "Ajna Aggregate" cluster in a Secure Zone (DMZ), from which consumers — Argus (time series), DeepSea (Hadoop/HDFS), and stream-processing applications — pull downstream. Before April 2018, replication used Apache MirrorMaker; since then it has used Mirus.
The platform runs on a hybrid deployment model: on-premises clusters orchestrated with Choria (self-healing automation) and public cloud clusters deployed on Kubernetes. Continuous automated patching is applied across the fleet.
Cross-site connectivity
Rather than using Kafka's advertised listeners over a VPN, Salesforce implemented a "Native Kafka Endpoint" using a standard load balancer fronting Envoy running in Layer 4 SNI-routing mode. This gives clients a stable single endpoint for cross-data-centre Kafka connections without VPN tunnels.
Source: Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Topic design and schema management
Platform Events and Change Data Capture events use Apache Avro serialisation, with event definitions driven by Salesforce's metadata layer. For the Conversation Storage Service, messages are keyed by conversation ID, ensuring all events for a given conversation land in the same partition and preserve ordering for downstream AI systems.
Producer architecture
The Agentforce audit pipeline uses Kafka's pub-sub model to absorb bursty, business-hour traffic spikes before handing off to Data Cloud. Dynamic flow control mechanisms handle real-time traffic adjustment in that pipeline.
Consumer architecture
For the Conversation Storage Service, Salesforce applies curated consumer lag limits to maintain ordering guarantees for real-time AI workflows. Where consumer lag creates read-after-write consistency problems (at 50K concurrent conversations), an in-memory cache (VegaCache) bridges the gap rather than forcing changes to the Kafka consumer configuration.
Stream processing
The Einstein Activity Capture pipeline relies entirely on Kafka Streams for its ML processing chain. The application team chose Kafka Streams specifically because upgrades can be managed without coordinating with the infrastructure team — unlike the prior Storm-based pipeline.
Source: Rohit Deshpande, Salesforce Engineering Blog
Special techniques and engineering innovations
Mirus: custom cross-cluster replication
Mirus is Salesforce's open-source replacement for Apache MirrorMaker, built on the Kafka Connect source connector API. Each MirusSourceTask runs an independent KafkaConsumer/KafkaProducer pair for a subset of partitions, enabling higher throughput over internet links than MirrorMaker's single-producer model. A KafkaMonitor thread tracks partition changes dynamically, allowing configuration changes via the Connect REST API without a process restart. The tool includes custom JMX metrics for replication lag.
Source: Paul Davidson, engineering.salesforce.com and github.com/salesforce/mirus
Envoy as Layer 4 SNI-routing proxy
For cross-site Kafka connectivity, Salesforce runs Envoy in SNI-routing mode behind a standard load balancer, giving clients a stable single endpoint for cross-data-centre connections. This avoids the need to publish per-broker advertised listeners across WAN links.
Conversation-level Kafka partitioning
For the Conversation Storage Service, messages are keyed by conversation ID, ensuring all events for a given conversation land in the same partition and preserve ordering for downstream AI systems that need to reason over conversational context in sequence.
Fault-tolerant chatbot pipeline patterns
The Service Cloud chatbot pipeline layers five fault-tolerance patterns on top of Kafka: a per-endpoint circuit breaker (OPEN/HALF_OPEN/CLOSED states), a bulkhead (concurrent request limits per endpoint), timeout-plus-retry (maximum 6 attempts over 16 hours), UUID-per-event idempotency with upsert semantics, and Kafka partition replication across availability zones.
Source: Mark Holton, Salesforce Engineering Blog
SSL contribution to open-source Kafka
Salesforce's DVA team contributed SSL support to the Kafka 0.8.2 series before it was part of the mainline project, and later advocated for per-topic throttling at the project level.
Source: Nishant Gupta, Salesforce Engineering Blog
4-byte embedded hash ID for deduplication
Because the log pipeline uses "at least once" delivery semantics, a 4-byte hashed unique ID is embedded in each log message so downstream consumers can detect and drop duplicates on replay.
Source: Sanjeev Sahu, Salesforce Engineering Blog
Operating Kafka at scale
Deployment model: Hybrid. On-premises clusters use Choria for automated patching and self-healing. Public cloud clusters run on Kubernetes. Continuous automated patching applies across the entire fleet.
Monitoring and observability stack:
Source: Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Automated recovery: Salesforce built a Kafka Auto-Fix Operations Tool that detects offline or under-replicated partitions, frozen brokers, and disk failures. It identifies the most advanced replica, then runs leader election and partition reassignment automatically. The target mean time to recovery is approximately 2 minutes with a zero data loss guarantee.
Fault injection testing: The infrastructure team uses a bespoke testing framework comprising a Test Coordinator (control plane web service), Load Generator (simulates producer/consumer traffic), Ops Tools (cluster configuration), a Fault Injection Framework built on Apache Trogdor and Kibosh, and a Result Analysis component (log and metric tracking). Test scenarios cover functional, performance, load, scale, upgrade/downgrade, and patching runs on deterministic infrastructure.
Migration monitoring: During the 2025 Marketing Cloud migration, the team ran 15 custom monitoring dashboards (cluster-level to node-level) with 24/7 automated alerting throughout.
Source: Dheeraj Bansal and Ankit Jain, Salesforce Engineering Blog
Challenges and how they solved them
MirrorMaker instability on WAN replication
Problem: Apache MirrorMaker 0.8.x was unstable and required a full process restart for any configuration change. Its single-producer model created throughput bottlenecks on internet-based WAN links between data centres.
Root cause: MirrorMaker's architecture — one process per cluster pair, one producer per process — could not scale to Salesforce's cross-data-centre replication volumes.
Solution: Salesforce built and open-sourced Mirus, a Kafka Connect-based tool with multiple independent consumer-producer pairs per worker process and dynamic REST API configuration.
Outcome: Mirus fully replaced MirrorMaker across all production data centres from April 2018.
Source: Paul Davidson, Salesforce Engineering Blog
Log pipeline completeness: from 25% to seven nines
Problem: Between December 2016 and September 2017, the log shipping pipeline could not reliably deliver logs to DeepSea. Completeness was as low as 25% at the start of the programme.
Root causes: Six distinct issues were identified: multi-threaded consumer logic couldn't handle dynamic volume changes; MapReduce fault tolerance was set too low; Logstash had file-skipping and zombie-state bugs; MirrorMaker batch size (~1 MB) caused batch rejections; and "at least once" semantics produced duplicates.
Solutions: Multi-threaded consumers with dynamic volume adjustment; MapReduce fault tolerance raised to 50%; Logstash reconfigured to read from the beginning; a novel buffer replay for zombie states; MirrorMaker batch size cut from ~1 MB to 250 KB; a 4-byte embedded hash ID per log record for deduplication.
Outcome: Completeness reached 99.99999% (7 nines) by September 2017 and has been sustained consistently since.
Source: Sanjeev Sahu, Salesforce Engineering Blog
Marketing Cloud migration: 760 nodes, zero downtime
Problem: Marketing Cloud ran its own Kafka fleet — 760+ nodes, 12 clusters, 1 million messages/second — on CentOS 7 with a different Kafka version and authentication mechanism from the central Ajna platform (RHEL 9). Unifying it into Ajna required a zero-downtime migration under live production traffic.
Root cause: Accumulated divergence between Marketing Cloud's self-managed Kafka and the centralised Ajna system over time.
Solution: An automated orchestration pipeline with pre-, in-flight, and post-validation checks; disk-level checksum validation before and after each step; rack-aware replica placement across separate physical racks; phased node rollout with health checks; 15 custom monitoring dashboards; and synthetic data validation post-migration with baseline signature matching.
Outcome: Zero-downtime migration completed; Marketing Cloud Kafka unified into Ajna.
Source: Dheeraj Bansal and Ankit Jain, Salesforce Engineering Blog
Read-after-write consistency at 50K concurrent conversations
Problem: At 50,000 concurrent conversations, Kafka consumers lagged behind producers, causing AI workflows to read stale conversational context immediately after a write.
Root cause: Kafka's durability-vs-latency trade-off: freshly written messages were not yet visible to consumers by the time the AI system queried for the latest state.
Solution: VegaCache, an in-memory cache, was introduced to serve recent writes directly, bypassing the consumer lag for latency-sensitive reads while keeping Kafka as the durable ordered source of truth.
Outcome: Read-after-write consistency maintained at scale.
Source: Ashima Kochar and Deepak Mali, Salesforce Engineering Blog
Agentforce audit: bursty traffic and Data Cloud ingestion mismatch
Problem: AI agent traffic is highly bursty (business-hour spikes); Data Cloud's ingestion API expected customer-owned S3 buckets, but Agentforce needed Salesforce-controlled S3 buckets — a file-size and ownership mismatch that blocked the integration.
Root cause: Architectural mismatch between Data Cloud's design assumptions and Agentforce's multi-tenant security model.
Solution: Kafka's pub-sub model absorbs traffic spikes before handoff to Data Cloud; dynamic flow control mechanisms handle real-time traffic adjustment; iterative proof-of-concept work resolved the S3 ownership problem.
Outcome: The system supports 500+ enterprise customers with 20 million model interactions per month under a 30-day contractual audit retention window.
Source: Madhavi Kavathekar, Salesforce Engineering Blog
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Build for replication from the start. Salesforce outgrew Apache MirrorMaker's single-producer model once WAN replication volumes increased. If cross-datacenter or cross-region replication is in your roadmap, evaluate whether your replication tool can handle dynamic configuration changes without a process restart and whether its throughput model scales with your partition count.
- Log pipeline completeness requires end-to-end measurement. Salesforce's log pipeline sat at 25% completeness for months before a dedicated completeness-tracking service (the Fidelity Assessment Service) made the problem visible end-to-end. Instrumenting each stage independently is insufficient; you need a view that correlates events from ingest to sink.
- "At least once" delivery requires an idempotency strategy. At Salesforce's scale, duplicates from at-least-once delivery are not hypothetical edge cases. Embedding a hashed unique ID per message and using upsert semantics at the consumer is a concrete, low-overhead approach that avoids the complexity of exactly-once transaction semantics in many pipeline architectures.
- Consumer lag is a first-class consistency concern for stateful workloads. The Conversation Storage Service's read-after-write consistency problem is common when Kafka is used as a state transport for AI or event-sourcing workloads. An in-memory cache for recent writes, rather than tightening consumer lag SLOs on the Kafka cluster itself, can resolve the issue with less operational risk.
- Invest in automated recovery tooling before you need it. Salesforce built its Kafka Auto-Fix Operations Tool to achieve a ~2-minute MTTR for partition and broker failures. At 2,500+ brokers, manual recovery runbooks cannot achieve this. Automating leader election and partition reassignment for predictable failure modes is worth the investment before the fleet reaches a size where manual intervention becomes the bottleneck.
Sources and further reading
If you are running Kafka at scale and want visibility into consumer lag, partition health, and broker performance across your clusters, Kpow is a Kafka management and monitoring tool built for platform and data engineering teams. You can try it free for 30 days and connect it to any Kafka cluster in minutes via Docker, Helm, or JAR.