Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Salesforce uses Apache Kafka in production

May 16th, 2026

xx min read

Salesforce has published more detail about its Kafka operations than most companies its size, which makes its engineering blog an unusually complete record of what it takes to run Apache Kafka at genuine enterprise scale. By 2021, the company's fleet had grown to 100+ clusters, 2,500+ brokers, 50,000+ topics, 300,000+ partitions, 40+ PB of storage, and throughput exceeding 3 trillion events per day — a figure that had been doubling annually.

The engineering problem Kafka solves at Salesforce is not a single use case but a platform-level one: how do you provide a unified, reliable event transport across a multi-tenant SaaS estate that spans hundreds of global data centres, hundreds of thousands of customer orgs, and applications ranging from operational metrics to real-time AI?

Company overview

Salesforce is a cloud-based CRM and enterprise software company founded in 1999 and headquartered in San Francisco. It serves more than 150,000 businesses globally across its Sales Cloud, Service Cloud, Marketing Cloud, and Agentforce AI product lines.

Salesforce adopted Kafka in production no later than May 2014, when it hosted an Apache Kafka edition of the SF Logging Meetup at its headquarters. By June 2014, Rajasekar Elango, a lead developer on the Monitoring and Management Team, was presenting the company's early Kafka setup at a public Kafka Meetup — a 3-node ZooKeeper, 5-broker cluster with SSL/TLS mutual authentication and Avro serialisation.

The trigger for adoption was operational observability: the DVA (Diagnostics, Visibility, and Analytics) team needed a scalable transport layer for metrics and logs across a growing global data centre footprint.

Date	Event
May 2014	Apache Kafka in production use confirmed; SF Logging Meetup hosted at Salesforce HQ
June 2014	Rajasekar Elango presents SSL/TLS Kafka setup at public Kafka Meetup
Late 2015	Next-generation log pipeline (project Ajna / DeepSea) development begins
2016	Ajna fleet reaches 30+ clusters and hundreds of billions of events per day
December 2016	Log pipeline completeness measured at only ~25%; reliability programme begins
September 2017	Log pipeline reaches 7-nines completeness (99.99999%), sustained consistently
April 2018	Mirus (custom cross-cluster replication tool) fully replaces MirrorMaker in all production data centres
2021	Kafka Summit Americas: Lei Ye and Paul Davidson disclose fleet at 100+ clusters, 3+ trillion events/day
August 2024	Hyperforce perimeter telemetry pipeline published: Kafka + Spark Streaming + Druid, 60 TB/day
April 2025	Marketing Cloud 760-node zero-downtime migration into Ajna completed
July 2025	Agentforce AI agent audit pipeline published: 20 million model interactions/month over Kafka
May 2026	Conversational AI scaling post published: Kafka introduced for 30K+ concurrent conversations, targeting 100K

Salesforce's Kafka use cases

Salesforce's use cases cover the full breadth of what Kafka is typically used for in a large enterprise, from infrastructure telemetry to customer-facing products to AI pipelines.

Operational metrics and system health (project Ajna)

The DVA team built an internal Kafka-as-a-Service platform code-named Ajna to transport CPU metrics, machine reachability data, system logs, network flow data, application logs, JMX metrics, and custom database metrics across all global data centres to near-real-time dashboards. Ajna is the backbone from which most other Kafka use cases at Salesforce have grown.

Source: Nishant Gupta, Salesforce Engineering Blog

Log shipping pipeline

The Infrastructure group built a next-generation log pipeline on top of Ajna, routing logs from local per-data-centre Kafka clusters through cross-WAN replication to an aggregate cluster in the Secure Zone, and from there into DeepSea (an internal Hadoop/HDFS store) via a MapReduce consumer job called Kafka Camus. The target was 5-nines completeness; it eventually reached 7 nines.

Source: Sanjeev Sahu, Salesforce Engineering Blog

Platform Events and Change Data Capture

Salesforce's Platform Events and Change Data Capture products expose a time-ordered, immutable event stream to hundreds of thousands of multi-tenant customer orgs. Kafka's log-based storage model directly inspired the design: events are offset-based, durable, and replayable. Serialisation uses Apache Avro, with event definitions driven by Salesforce's metadata layer.

Source: Alexey Syomichev, Salesforce Engineering Blog

Real-time ML insights for Einstein Activity Capture

The Activity Platform ingests 100+ million customer interactions per day through Kafka. A chain of Kafka Streams jobs processes the stream: a Gatekeeper spam filter passes clean events to a series of extractors (TensorFlow-based, regex/keyword, Spark ML, and Einstein Conversation Insights), which feed a transformer and then a persistor that writes sales activity insights to Cassandra.

Source: Rohit Deshpande, Salesforce Engineering Blog

Service Cloud chatbot audit and debug

Apache Kafka on Heroku backs the chatbot event pipeline for Service Cloud, targeting tens of millions of debug events daily with up to 7-day retention. The pipeline applies five fault-tolerance patterns on top of Kafka, described further in the special techniques section below.

Source: Mark Holton, Salesforce Engineering Blog

Hyperforce perimeter telemetry

Salesforce's Hyperforce infrastructure generates 60 TB of raw perimeter telemetry daily from commercial CDNs. Spark Streaming normalises this data into protobuf format; Kafka then carries it to an Imply/Druid hypercube (18 dimensions, 13 measurements) serving real-time debugging dashboards for live-site incidents, latency analysis, and DDoS detection with sub-second query latency and 2-minute data freshness.

Source: Srinivas Ranganathan, Salesforce Engineering Blog

Conversational AI context storage for Agentforce

The Conversation Storage Service (CSS) uses Kafka for buffering, batching, and ordering conversational context that powers real-time AI systems — sentiment analysis, agent assist, and supervisor insights — across 50,000+ concurrent conversations at a peak ingestion rate of 30,000+ events per minute. The scaling target is 100,000 concurrent conversations.

Source: Ashima Kochar and Deepak Mali, Salesforce Engineering Blog

Agentforce AI agent audit trail

Kafka pub-sub ingestion handles the variable, spike-prone traffic from 20 million model interactions per month across 500+ enterprise customers, feeding a 30-day audit data retention pipeline backed by Salesforce Data Cloud and S3.

Source: Madhavi Kavathekar, Salesforce Engineering Blog

Cross-cluster replication

Salesforce replaced Apache MirrorMaker with its own open-source tool, Mirus, for cross-data-centre replication. Mirus is built on the Kafka Connect source connector API and has been in production since April 2018.

Source: Paul Davidson, Salesforce Engineering Blog

Scale and throughput

The fleet figures below are from the Kafka Summit Americas 2021 talk by Lei Ye and Paul Davidson, supplemented with per-system figures from individual engineering blog posts.

Metric	Value	Source
Total clusters	100+	Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Total brokers	2,500+ (largest clusters: 250+ brokers each)	Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Topics	50,000+	Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Partitions	300,000+	Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Storage	40+ PB	Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Throughput	3+ trillion events/day (doubling annually)	Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Fleet availability	Over 99.9%	Lei Ye and Paul Davidson, Kafka Summit Americas 2021
Marketing Cloud cluster (pre-migration 2025)	760+ nodes, 12 clusters, 1 million messages/second, 15 TB/day	Dheeraj Bansal and Ankit Jain, April 2025
Einstein Activity Capture	100+ million customer interactions/day	Rohit Deshpande
Hyperforce telemetry	60 TB raw perimeter data/day	Srinivas Ranganathan, August 2024
Agentforce AI	20 million model interactions/month, 2 billion predictions/month	Madhavi Kavathekar, July 2025
Conversational AI (CSS)	30,000+ events/minute peak; scaling target of 100K concurrent conversations	Ashima Kochar and Deepak Mali, May 2026
Chatbot pipeline	Tens of millions of events/day, 7-day retention	Mark Holton
Ajna fleet (2016 snapshot)	30+ clusters, ~300 Mbps aggregate throughput, hundreds of billions of events/day	Nishant Gupta, July 2016

Salesforce's Kafka architecture

Ajna: Kafka-as-a-Service

Ajna is Salesforce's internal Kafka platform. Each production data centre runs an "Ajna Local" cluster for ingest. Data replicates over WAN to an "Ajna Aggregate" cluster in a Secure Zone (DMZ), from which consumers — Argus (time series), DeepSea (Hadoop/HDFS), and stream-processing applications — pull downstream. Before April 2018, replication used Apache MirrorMaker; since then it has used Mirus.

The platform runs on a hybrid deployment model: on-premises clusters orchestrated with Choria (self-healing automation) and public cloud clusters deployed on Kubernetes. Continuous automated patching is applied across the fleet.

Cross-site connectivity

Rather than using Kafka's advertised listeners over a VPN, Salesforce implemented a "Native Kafka Endpoint" using a standard load balancer fronting Envoy running in Layer 4 SNI-routing mode. This gives clients a stable single endpoint for cross-data-centre Kafka connections without VPN tunnels.

Source: Lei Ye and Paul Davidson, Kafka Summit Americas 2021

Topic design and schema management

Platform Events and Change Data Capture events use Apache Avro serialisation, with event definitions driven by Salesforce's metadata layer. For the Conversation Storage Service, messages are keyed by conversation ID, ensuring all events for a given conversation land in the same partition and preserve ordering for downstream AI systems.

Producer architecture

The Agentforce audit pipeline uses Kafka's pub-sub model to absorb bursty, business-hour traffic spikes before handing off to Data Cloud. Dynamic flow control mechanisms handle real-time traffic adjustment in that pipeline.

Consumer architecture

For the Conversation Storage Service, Salesforce applies curated consumer lag limits to maintain ordering guarantees for real-time AI workflows. Where consumer lag creates read-after-write consistency problems (at 50K concurrent conversations), an in-memory cache (VegaCache) bridges the gap rather than forcing changes to the Kafka consumer configuration.

Stream processing

The Einstein Activity Capture pipeline relies entirely on Kafka Streams for its ML processing chain. The application team chose Kafka Streams specifically because upgrades can be managed without coordinating with the infrastructure team — unlike the prior Storm-based pipeline.

Source: Rohit Deshpande, Salesforce Engineering Blog

Special techniques and engineering innovations

Mirus: custom cross-cluster replication

Mirus is Salesforce's open-source replacement for Apache MirrorMaker, built on the Kafka Connect source connector API. Each MirusSourceTask runs an independent KafkaConsumer/KafkaProducer pair for a subset of partitions, enabling higher throughput over internet links than MirrorMaker's single-producer model. A KafkaMonitor thread tracks partition changes dynamically, allowing configuration changes via the Connect REST API without a process restart. The tool includes custom JMX metrics for replication lag.

Source: Paul Davidson, engineering.salesforce.com and github.com/salesforce/mirus

Envoy as Layer 4 SNI-routing proxy

For cross-site Kafka connectivity, Salesforce runs Envoy in SNI-routing mode behind a standard load balancer, giving clients a stable single endpoint for cross-data-centre connections. This avoids the need to publish per-broker advertised listeners across WAN links.

Conversation-level Kafka partitioning

For the Conversation Storage Service, messages are keyed by conversation ID, ensuring all events for a given conversation land in the same partition and preserve ordering for downstream AI systems that need to reason over conversational context in sequence.

Fault-tolerant chatbot pipeline patterns

The Service Cloud chatbot pipeline layers five fault-tolerance patterns on top of Kafka: a per-endpoint circuit breaker (OPEN/HALF_OPEN/CLOSED states), a bulkhead (concurrent request limits per endpoint), timeout-plus-retry (maximum 6 attempts over 16 hours), UUID-per-event idempotency with upsert semantics, and Kafka partition replication across availability zones.

Source: Mark Holton, Salesforce Engineering Blog

SSL contribution to open-source Kafka

Salesforce's DVA team contributed SSL support to the Kafka 0.8.2 series before it was part of the mainline project, and later advocated for per-topic throttling at the project level.

Source: Nishant Gupta, Salesforce Engineering Blog

4-byte embedded hash ID for deduplication

Because the log pipeline uses "at least once" delivery semantics, a 4-byte hashed unique ID is embedded in each log message so downstream consumers can detect and drop duplicates on replay.

Source: Sanjeev Sahu, Salesforce Engineering Blog

Operating Kafka at scale

Deployment model: Hybrid. On-premises clusters use Choria for automated patching and self-healing. Public cloud clusters run on Kubernetes. Continuous automated patching applies across the entire fleet.

Monitoring and observability stack:

Tool	Role
Cruise Control (LinkedIn open-source)	Automated cluster workload rebalancing
Xinfra Monitor (LinkedIn open-source)	Synthetic end-to-end availability and latency checks via synthetic workloads
Argus (Salesforce open-source)	OpenTSDB-based time series monitoring and alerting for Kafka and downstream systems
Splunk	Log analysis for Kafka operations
Wrangler (Salesforce internal)	Cross-cluster UI for topic management and fleet-wide visibility
Fidelity Assessment Service (internal)	Tracks log pipeline completeness end-to-end; results surfaced in Einstein Analytics dashboards

Source: Lei Ye and Paul Davidson, Kafka Summit Americas 2021

Automated recovery: Salesforce built a Kafka Auto-Fix Operations Tool that detects offline or under-replicated partitions, frozen brokers, and disk failures. It identifies the most advanced replica, then runs leader election and partition reassignment automatically. The target mean time to recovery is approximately 2 minutes with a zero data loss guarantee.

Fault injection testing: The infrastructure team uses a bespoke testing framework comprising a Test Coordinator (control plane web service), Load Generator (simulates producer/consumer traffic), Ops Tools (cluster configuration), a Fault Injection Framework built on Apache Trogdor and Kibosh, and a Result Analysis component (log and metric tracking). Test scenarios cover functional, performance, load, scale, upgrade/downgrade, and patching runs on deterministic infrastructure.

Migration monitoring: During the 2025 Marketing Cloud migration, the team ran 15 custom monitoring dashboards (cluster-level to node-level) with 24/7 automated alerting throughout.

Source: Dheeraj Bansal and Ankit Jain, Salesforce Engineering Blog

Challenges and how they solved them

MirrorMaker instability on WAN replication

Problem: Apache MirrorMaker 0.8.x was unstable and required a full process restart for any configuration change. Its single-producer model created throughput bottlenecks on internet-based WAN links between data centres.

Root cause: MirrorMaker's architecture — one process per cluster pair, one producer per process — could not scale to Salesforce's cross-data-centre replication volumes.

Solution: Salesforce built and open-sourced Mirus, a Kafka Connect-based tool with multiple independent consumer-producer pairs per worker process and dynamic REST API configuration.

Outcome: Mirus fully replaced MirrorMaker across all production data centres from April 2018.

Source: Paul Davidson, Salesforce Engineering Blog

Log pipeline completeness: from 25% to seven nines

Problem: Between December 2016 and September 2017, the log shipping pipeline could not reliably deliver logs to DeepSea. Completeness was as low as 25% at the start of the programme.

Root causes: Six distinct issues were identified: multi-threaded consumer logic couldn't handle dynamic volume changes; MapReduce fault tolerance was set too low; Logstash had file-skipping and zombie-state bugs; MirrorMaker batch size (~1 MB) caused batch rejections; and "at least once" semantics produced duplicates.

Solutions: Multi-threaded consumers with dynamic volume adjustment; MapReduce fault tolerance raised to 50%; Logstash reconfigured to read from the beginning; a novel buffer replay for zombie states; MirrorMaker batch size cut from ~1 MB to 250 KB; a 4-byte embedded hash ID per log record for deduplication.

Outcome: Completeness reached 99.99999% (7 nines) by September 2017 and has been sustained consistently since.

Source: Sanjeev Sahu, Salesforce Engineering Blog

Marketing Cloud migration: 760 nodes, zero downtime

Problem: Marketing Cloud ran its own Kafka fleet — 760+ nodes, 12 clusters, 1 million messages/second — on CentOS 7 with a different Kafka version and authentication mechanism from the central Ajna platform (RHEL 9). Unifying it into Ajna required a zero-downtime migration under live production traffic.

Root cause: Accumulated divergence between Marketing Cloud's self-managed Kafka and the centralised Ajna system over time.

Solution: An automated orchestration pipeline with pre-, in-flight, and post-validation checks; disk-level checksum validation before and after each step; rack-aware replica placement across separate physical racks; phased node rollout with health checks; 15 custom monitoring dashboards; and synthetic data validation post-migration with baseline signature matching.

Outcome: Zero-downtime migration completed; Marketing Cloud Kafka unified into Ajna.

Source: Dheeraj Bansal and Ankit Jain, Salesforce Engineering Blog

Read-after-write consistency at 50K concurrent conversations

Problem: At 50,000 concurrent conversations, Kafka consumers lagged behind producers, causing AI workflows to read stale conversational context immediately after a write.

Root cause: Kafka's durability-vs-latency trade-off: freshly written messages were not yet visible to consumers by the time the AI system queried for the latest state.

Solution: VegaCache, an in-memory cache, was introduced to serve recent writes directly, bypassing the consumer lag for latency-sensitive reads while keeping Kafka as the durable ordered source of truth.

Outcome: Read-after-write consistency maintained at scale.

Source: Ashima Kochar and Deepak Mali, Salesforce Engineering Blog

Agentforce audit: bursty traffic and Data Cloud ingestion mismatch

Problem: AI agent traffic is highly bursty (business-hour spikes); Data Cloud's ingestion API expected customer-owned S3 buckets, but Agentforce needed Salesforce-controlled S3 buckets — a file-size and ownership mismatch that blocked the integration.

Root cause: Architectural mismatch between Data Cloud's design assumptions and Agentforce's multi-tenant security model.

Solution: Kafka's pub-sub model absorbs traffic spikes before handoff to Data Cloud; dynamic flow control mechanisms handle real-time traffic adjustment; iterative proof-of-concept work resolved the S3 ownership problem.

Outcome: The system supports 500+ enterprise customers with 20 million model interactions per month under a 30-day contractual audit retention window.

Source: Madhavi Kavathekar, Salesforce Engineering Blog

Full tech stack

Category	Tools	Notes
Message broker	Apache Kafka	Kafka 2.6 across main fleet (2021); early versions 0.8.2 and 0.9
Cross-cluster replication	Mirus (Salesforce open-source), Apache MirrorMaker (legacy)	Mirus replaced MirrorMaker in all data centres from April 2018; built on Kafka Connect source connector API
Schema registry / serialisation	Apache Avro	Used for Platform Events, Change Data Capture, and early monitoring pipeline
Stream processing	Kafka Streams, Apache Spark Streaming	Kafka Streams for Einstein Activity Capture ML chain; Spark Streaming for Hyperforce telemetry normalisation
Connectors	Kafka Connect (Mirus source connector), Kafka Camus	Camus is a MapReduce job consuming from Ajna Aggregate to DeepSea (HDFS)
Cluster coordination	Apache ZooKeeper	60+ ZooKeeper nodes in Marketing Cloud fleet alone before 2025 migration
Proxy / networking	Envoy	Layer 4 SNI-routing proxy for cross-site Native Kafka Endpoint
Monitoring	Cruise Control, Xinfra Monitor, Argus, Splunk, Wrangler, Datadog Vector	Cruise Control for workload rebalancing; Argus is OpenTSDB-based; Wrangler is Salesforce's internal cross-cluster UI; Vector collects CDN telemetry
Deployment / infra	Choria (on-premises), Kubernetes (public cloud)	Choria provides self-healing and automated patching for on-premises clusters
ML inference	TensorFlow, Spark ML	Both used in Einstein Activity Capture extractor chain
Container orchestration (app layer)	Nomad (migrated from DC/OS)	Used by Einstein Activity Capture pipeline
Storage sinks	Apache Hadoop/HDFS (DeepSea), Cassandra, S3 (AWS)	DeepSea for log retention; Cassandra for Einstein insights; S3 for Agentforce audit and Mirus replication target
Real-time analytics	Imply / Apache Druid	18-dimension, 13-measurement hypercube for Hyperforce telemetry; sub-second query latency
Caching	VegaCache (Salesforce internal)	In-memory cache for read-after-write consistency in Conversation Storage Service
Log processing	Logstash	Runs on each host, publishing to local Ajna Kafka cluster
Fault injection	Trogdor, Kibosh	Foundation for Salesforce's bespoke fault injection testing framework
Database	PostgreSQL	Configuration storage in Einstein pipeline; initial database for Conversation Storage Service before Kafka introduction

Key contributors

Name	Role	Contribution
Nishant Gupta	Sr. Director of Engineering, DVA team	Authored "Expanding Visibility with Apache Kafka"; presented Ajna platform publicly in July 2016
Rajasekar Elango	Lead Developer, Monitoring and Management Team	Presented Secure Kafka at Salesforce at Kafka Meetup, June 2014
Alexey Syomichev	Engineer, Salesforce	Authored "How Apache Kafka Inspired Our Platform Events Architecture"
Sanjeev Sahu	Engineer, Infrastructure	Authored "Our Journey to a Near Perfect Log Pipeline"
Rohit Deshpande	Engineer, Activity Platform / Einstein	Authored "Real-time Einstein Insights Using Kafka Streams"; contributed three Kafka Improvement Proposals (KIPs) to the Apache Kafka project
Paul Davidson	Software Architect, Kafka Platform	Original developer of Mirus; co-presented at Kafka Summit Americas 2021; authored "Open Sourcing Mirus"
Lei Ye	Principal Software Engineer, Infrastructure	Co-presented "Tales from the Four-Comma Club" at Kafka Summit Americas 2021
Mark Holton	Engineer, Service Cloud	Authored both chatbot pipeline blog posts
Dheeraj Bansal	Principal Member of Technical Staff	Led the Marketing Cloud 760-node Kafka migration (April 2025)
Ankit Jain	Senior Director of Software Engineering, Platform Architecture and Distributed Systems	Co-authored the Marketing Cloud migration post
Srinivas Ranganathan	Director of Software Engineering	Authored the Hyperforce perimeter telemetry pipeline post (August 2024)
Madhavi Kavathekar	Director of Software Engineering	Authored the Agentforce AI agent auditing systems post (July 2025)
Ashima Kochar	Lead Software Engineer, Service Cloud	Lead author on the conversational AI scaling post (May 2026)
Deepak Mali	Engineer, Service Cloud	Co-author on the conversational AI scaling post (May 2026)

Key takeaways for your own Kafka implementation

Build for replication from the start. Salesforce outgrew Apache MirrorMaker's single-producer model once WAN replication volumes increased. If cross-datacenter or cross-region replication is in your roadmap, evaluate whether your replication tool can handle dynamic configuration changes without a process restart and whether its throughput model scales with your partition count.
Log pipeline completeness requires end-to-end measurement. Salesforce's log pipeline sat at 25% completeness for months before a dedicated completeness-tracking service (the Fidelity Assessment Service) made the problem visible end-to-end. Instrumenting each stage independently is insufficient; you need a view that correlates events from ingest to sink.
"At least once" delivery requires an idempotency strategy. At Salesforce's scale, duplicates from at-least-once delivery are not hypothetical edge cases. Embedding a hashed unique ID per message and using upsert semantics at the consumer is a concrete, low-overhead approach that avoids the complexity of exactly-once transaction semantics in many pipeline architectures.
Consumer lag is a first-class consistency concern for stateful workloads. The Conversation Storage Service's read-after-write consistency problem is common when Kafka is used as a state transport for AI or event-sourcing workloads. An in-memory cache for recent writes, rather than tightening consumer lag SLOs on the Kafka cluster itself, can resolve the issue with less operational risk.
Invest in automated recovery tooling before you need it. Salesforce built its Kafka Auto-Fix Operations Tool to achieve a ~2-minute MTTR for partition and broker failures. At 2,500+ brokers, manual recovery runbooks cannot achieve this. Automating leader election and partition reassignment for predictable failure modes is worth the investment before the fleet reaches a size where manual intervention becomes the bottleneck.

Sources and further reading

#	Source	Author	Date
1	Expanding Visibility with Apache Kafka	Nishant Gupta	2016
2	How Apache Kafka Inspired Our Platform Events Architecture	Alexey Syomichev	2018
3	Real-time Einstein Insights Using Kafka Streams	Rohit Deshpande	2019
4	Our Journey to a Near Perfect Log Pipeline	Sanjeev Sahu	December 2017
5	Open Sourcing Mirus	Paul Davidson	October 2017
6	Building a Scalable Event Pipeline with Heroku and Salesforce	Mark Holton	2020
7	Building a Fault-Tolerant Data Pipeline for Chatbots	Mark Holton	2020
8	Inside Marketing Cloud's Kafka Cluster Upgrade: Migrating 760 Nodes with 1,000,000 Messages/Second	Dheeraj Bansal, Ankit Jain	April 2025
9	Architecting AI Agent Auditing Systems in Agentforce	Madhavi Kavathekar	July 2025
10	Scaling AI-Driven Conversations from 10K to 100K While Maintaining Real-Time Consistency	Ashima Kochar, Deepak Mali	May 2026
11	Unlocking Real-Time Insights: Engineering the Hyperforce Experience	Srinivas Ranganathan	August 2024
12	Tales from the Four-Comma Club — Managing Kafka as a Service at Salesforce	Lei Ye, Paul Davidson	Kafka Summit Americas 2021
13	Enabling Real-Time Scenarios at Scale Using Kafka	Nishant Gupta	July 2016
14	Secure Kafka at Salesforce.com	Rajasekar Elango	June 2014
15	Mirus — GitHub	Paul Davidson (primary maintainer)	Last release June 2021

If you are running Kafka at scale and want visibility into consumer lag, partition health, and broker performance across your clusters, Kpow is a Kafka management and monitoring tool built for platform and data engineering teams. You can try it free for 30 days and connect it to any Kafka cluster in minutes via Docker, Helm, or JAR.

Kafka cluster monitoring

A detailed guide to Kafka producer monitoring

Kafka consumer monitoring and performance tuning

Kafka broker monitoring

Apache Kafka architecture: a complete guide to internals, components, and deployment

Kafka monitoring: a complete guide for platform engineers

Data governance for Kafka: introducing lineage support in Factor Platform

Defense in depth: unifying RBAC and data policies for transparent governance

CMAK: Review, pricing, and best alternatives in 2026

Kadeck: Review, pricing, and best alternatives in 2026

AKHQ: Review, pricing, and best alternatives in 2026

Redpanda Console: Review, pricing, and best alternatives in 2026