Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Walmart uses Apache Kafka in production

May 23rd, 2026

xx min read

Walmart has been running Apache Kafka in production since at least 2016, and the scale of the deployment today reflects a decade of architectural iteration. As of June 2024, Walmart's Kafka infrastructure processes trillions of messages per day, serves 25,000+ consumers across public and private clouds, and maintains 99.99% availability. The engineering problems Kafka solves at Walmart range from near-real-time search indexing and replenishment planning across 5,000+ stores to fraud detection on every online transaction and a Customer Data Platform ingesting around 40 billion events per day.

Company overview

Walmart is the world's largest retailer by revenue, operating roughly 10,500 stores across 19 countries and a significant e-commerce business. Its technology organisation, Walmart Global Tech, manages the platform infrastructure that supports in-store systems, e-commerce, supply chain, and customer-facing applications simultaneously.

Kafka arrived at Walmart before 2016, initially on shared bare-metal clusters that replaced Apache Flume for log collection and tracking-pixel feeds. By late 2016, engineers Ning Zhang and Anil Kumar documented the first architectural shift: moving from shared clusters to a self-serving model where teams deployed dedicated clusters via OneOps, Walmart's internal OpenStack-based platform. That shift established the architectural pattern Walmart has built on ever since: many purpose-built Kafka clusters rather than a single shared monolith.

Key milestones

Pre-2016: Kafka deployed on shared bare-metal clusters, replacing Flume for log collection and tracking pixel ingestion
Nov 2016: Migration to self-serving dedicated clusters via OneOps; 7 datacentres, 5 local clusters, 2 aggregation clusters
Oct 2016: Kafka becomes the backbone for the near-real-time search index and hundreds of microservices in the item-setup pipeline
Nov 2017: Kafka + Apache Druid analytics cluster reaches nearly 1 billion+ events per day
Jun 2019: Inventory system ingesting 500 million events per day across 5,000+ stores; Deepak Goyal presents Kafka Streams extensions at Kafka Summit NYC
2020–2021: Navinder Pal Singh Brar presents KIP-535 and KIP-562 at Kafka Summit; both merged into Apache Kafka
May 2022: Replenishment system reaches 85 GB/min throughput across 18 brokers processing close to 100 million SKUs in under 3 hours
Jul 2022: Cassandra CDC pipeline with Debezium and Redis Bloom filters deployed at 60,000 changes per second
Jun 2024: Messaging Proxy Service launched; trillions of messages per day, 25,000+ consumers, 99.99% availability

Walmart's Kafka use cases

Walmart uses Kafka across multiple systems, each solving a distinct operational problem. The use cases described here are drawn from published engineering posts by named Walmart engineers between 2016 and 2024.

Near-real-time search index

Anil Kumar, a Global eCommerce Engineer at Walmart Labs, described Kafka as "the backbone for our New Near Real Time (NRT) Search Index, where changes are reflected on the site in seconds" in a 2016 Confluent post. The system processes billions of pricing and inventory updates per day. Before Kafka, search index latency was measured in hours; the Kafka-backed pipeline reduced that to seconds.

Item-setup pipeline and microservices backbone

Kafka connects hundreds of microservices in Walmart's item-setup pipeline, covering product normalisation, offers validation, pricing, inventory, and logistics data. Teams in different geographies operate autonomously because Kafka decouples producers from consumers: each team publishes to its own topics without coordinating with downstream consumers. The architecture uses many small clusters with hundreds of topics rather than one shared cluster, limiting the blast radius of any individual failure.

Real-time inventory management

Suman Pattnaik, Director of Engineering at Walmart, described a real-time inventory system in a 2020 Confluent post that replaced batch-based inventory processes. The system ingests from more than 10 sources of event streaming data and maintains a denormalised canonical view of inventory positions across stores, e-commerce, and distribution centres.

Real-time replenishment

The replenishment platform, described by Pattnaik in a 2022 Confluent post, spans 5,000+ stores, 150+ distribution centres, and 1,000+ vendors across 24+ countries. Change data capture feeds a denormalised view that passes through a planning engine holding inventory positions, forecasts, and safety stocks. The system processed close to 100 million SKUs in under three hours at 85 GB/min throughput as of 2022.

Customer Data Platform (CDP)

The Customer Backbone team, led by Navinder Pal Singh Brar (Staff Engineer) and Deepak Goyal, built a multi-tenant platform on Kafka Streams that ingests customer activity events (search, add-to-cart, transactions), builds customer identity and state using RocksDB, and triggers ML models (bid models, fraud detection, omnichannel reorder) per-event or in batch. The CDP processes around 40 billion events per day and serves processed knowledge at under 10 ms (95th-percentile) latency.

Fraud detection

Walmart runs fraud detection on every online transaction. The fraud detection application uses Kafka Streams and was the primary driver behind Walmart's open-source contributions KIP-535 and KIP-562 — described in detail in the challenges section below.

Event stream analytics

A Druid-based analytics cluster fed from Kafka via Apache Storm ingests nearly 1 billion+ events per day (2 TB of raw data) and serves sub-second OLAP query latencies, replacing Hadoop/Hive/Presto workflows that previously took hours. Amaresh Nayak documented this pipeline on the Walmart Global Tech Blog in November 2017.

Cassandra CDC

Scott Harvester and Nitin Chhabra documented a pipeline in July 2022 that uses Debezium to capture change data from Cassandra 4.x and publish it to Kafka, supporting high-volume e-commerce databases that generate 60,000 data changes per second.

Scale and throughput

Messages per day (platform-wide, 2024): Trillions (Ravinder Matte et al., Walmart Global Tech Blog, Jun 2024)
Kafka consumers (2024): 25,000+ (Ravinder Matte et al., Walmart Global Tech Blog, Jun 2024)
Platform availability (2024): 99.99% (Ravinder Matte et al., Walmart Global Tech Blog, Jun 2024)
Customer Data Platform events/day: ~40 billion (Navinder Pal Singh Brar, Strata Data Conference NY, Sep 2019)
CDP p95 serving latency: <10 ms (Navinder Pal Singh Brar, Strata Data Conference NY, Sep 2019)
Replenishment throughput (2022): 85 GB/min (Suman Pattnaik, Confluent Blog, May 2022)
Replenishment brokers: 18 brokers; 500+ partitions per topic (Suman Pattnaik, Confluent Blog, May 2022)
Replenishment SKUs processed per run: Close to 100 million in <3 hours (Suman Pattnaik, Confluent Blog, May 2022)
Inventory events/day (2019): 500 million (projected to 650 million) (Pattnaik and Subburaj, Kafka Summit SF 2019)
Druid analytics events/day (2017): Nearly 1 billion+ (2 TB raw) (Amaresh Nayak, Walmart Global Tech Blog, Nov 2017)
Cassandra CDC changes/second: 60,000 (Harvester and Chhabra, Walmart Global Tech Blog, Jul 2022)
Early topology (2016): 7 geographic datacentres; 5 local + 2 aggregation clusters (Ning Zhang, Walmart Global Tech Blog, Nov 2016)

Walmart's Kafka architecture

Multi-cluster topology

Walmart has consistently favoured many purpose-built clusters over a shared monolith. The early 2016 architecture ran 5 local clusters and 2 aggregation clusters across 7 geographic datacentres. Producers wrote to local clusters only; consumers listened to both local and mirrored topics simultaneously. Patched MirrorMaker with complete topic renaming replicated data between sites, designed to degrade gracefully on datacentre failure and resume from the point of failure on recovery.

The item-setup pipeline used a separate "many small clusters with hundreds of topics" model (Anil Kumar, 2016). Each team maintained its own cluster, limiting cross-team coordination and blast radius.

WCNP and hybrid cloud (current)

Consumer applications now run as container images on WCNP (Walmart Cloud Native Platform), an enterprise-grade Kubernetes-based multi-cloud container orchestration framework spanning private cloud and public cloud (Azure and Google Cloud). The MPS architecture introduced in 2024 targets stateless consumer services that auto-scale on WCNP without touching Kafka partition counts.

Kafka Streams as a distributed NoSQL database

The Customer Backbone team uses Kafka Streams not just for stream processing but as the stateful data store for the Customer Data Platform. Each customer profile is maintained in RocksDB. Enriched profiles feed downstream Kafka Streams applications for recommendations and fraud detection, creating a graph of event-driven microservices. This pattern requires custom extensions to make Kafka Streams operationally stable at scale (see Special Techniques below).

Replenishment architecture

The replenishment system uses 18 Kafka brokers with 500+ partitions per topic. CDC feeds a denormalised view that passes through a planning engine. The system operates in an active-passive replication model between datacentres and includes a fallback REST service to a database if Kafka becomes unavailable.

Cassandra CDC pipeline

Debezium 1.9 reads Cassandra 4.x commit logs continuously and publishes changes to Kafka. A Redis + RedisBloom (Bloom filter) deduplication layer, partitioned hourly, handles the 9x record fan-out caused by 3-region, replication-factor-3 deployments. Walmart's internal Data Acquisition Tool (DAQ) orchestrates the pipeline on Azure.

Druid analytics pipeline

Kafka delivers events to Apache Storm (using Trident), which enriches them via a custom log scraper and writes to Druid. Druid pre-aggregates (rollup) at ingestion and stores data in inverted-index, columnar format for sub-second OLAP queries.

Producer architecture

For the replenishment system, Suman Pattnaik documented the following producer configuration in 2022:

Custom partitioner using a murmur hash function to ensure each item-store combination lands on a single partition
linger.ms tuning for batching
batch.size set to 16,000 bytes
acks=all

Consumer architecture

Consumer configuration for the replenishment system includes: max.poll.records and max.poll.interval.ms explicitly tuned, auto-commit disabled, and session timeout and heartbeat interval explicitly set.

As of 2024, many consumer workloads are decoupled from direct Kafka consumption via the Messaging Proxy Service (described below).

Kafka Connect ecosystem

Kafka Connect is used to sink topics to HDFS for long-term storage, with max.poll.interval.ms, max.poll.records, and session.timeout.ms tuned for the HDFS connector. Debezium (acting as a Kafka Connect source) handles Cassandra CDC ingestion.

Special techniques and engineering innovations

Messaging Proxy Service (MPS)

The most significant architectural innovation in Walmart's recent Kafka history is the Messaging Proxy Service, published by Ravinder Matte, Vilas Athavale, Sid Anand, and colleagues in June 2024.

The standard approach to scaling Kafka consumers is to increase partition count to match parallelism. The problem with this at Walmart's scale is that consumer group rebalances become frequent and disruptive: pod scaling events, rolling deployments, or processing delays exceeding max.poll.interval.ms all trigger rebalances that cause consumer lag and SLA violations.

MPS inserts an HTTP proxy layer between Kafka and consumer services:

A single reader thread polls Kafka and writes to a bounded in-memory PendingQueue
A writer thread pool exposes stateless REST endpoints that consumer pods call
An Order Iterator preserves per-key ordering across the proxy boundary
A Dead Letter Queue handles poison-pill messages
A separate offset commit thread manages Kafka offsets, decoupling commit timing from processing

Consumer pods become stateless services that can auto-scale on Kubernetes without triggering Kafka rebalances. Consumer count and partition count are fully decoupled. According to the Walmart engineering team, this eliminated most rebalances in production, with the exception of rare cluster restarts or network events.

Cold Bootstrap for Kafka Streams recovery

Deepak Goyal presented this technique at Kafka Summit NYC 2019. Kafka Streams recovers stateful tasks by replaying changelog topics, but for the Customer Data Platform's large RocksDB stores, those changelogs held gigabytes of data, making standby recovery very slow.

Cold Bootstrap replaces changelog replay with a direct RocksDB snapshot copy: when a node fails, the standby copies the active node's RocksDB snapshot directly via JSch, then resumes from the saved offset. This achieves zero event loss with significantly faster recovery, and eliminates the need for indefinite changelog topic retention.

Dynamic repartitioning in Kafka Streams

Goyal's team added support for elastic repartitioning of Kafka Streams state across new partitions at runtime, enabling scaling to any number of partitions and nodes without a full restart. This was not available in stock Kafka Streams at the time.

Rack/AZ-aware task assignment

Active and standby Kafka Streams tasks for the same partition are explicitly assigned to different racks or availability zones, improving resilience without additional hardware.

Availability-first Kafka Streams (KIP-535 and KIP-562)

Navinder Pal Singh Brar presented this work at Kafka Summit 2020. The fraud detection application requires Kafka Streams reads on every transaction, but Kafka Streams' default behaviour blocks reads during consumer rebalances. At Walmart's transaction volume, this was incompatible with latency requirements.

Walmart contributed two KIPs to the Apache Kafka project:

KIP-535: Read from replica standbys and read while rebalancing — explicitly trading some consistency for higher availability
KIP-562: Read from the store of a specific partition

Both were merged into Apache Kafka. Brar holds four US patents related to Kafka Streams and was named a Confluent Community Catalyst.

Custom partitioner for dirty-write prevention

The replenishment system uses a custom partitioner based on a murmur hash function, ensuring each item-store combination always lands on a single partition and is consumed by a single consumer. This prevents database deadlocks from concurrent writes that would occur if the same item-store pair were processed by multiple consumers simultaneously.

Bloom-filter deduplication for Cassandra CDC

Each Cassandra write in a 3-region, RF=3 cluster generates 9 CDC records. Out-of-order delivery and partial column updates compound the problem. Walmart uses Redis-backed RedisBloom Bloom filters partitioned hourly, with error rates tested at 1 in 1 million, to deduplicate records before they reach downstream consumers. Production configuration achieved 623,000 deduplicated records per minute across 3 nodes.

Walmart also enhanced Debezium to process Cassandra commit logs without waiting for log file completion, reducing CDC latency further.

Operating Kafka at scale

Deployment model

Walmart's Kafka infrastructure spans self-managed clusters on private cloud (via OneOps historically, WCNP currently) and public cloud (Azure and Google Cloud). The deployment is hybrid: Kafka brokers remain within Walmart's infrastructure while consumer workloads run on WCNP across clouds.

Monitoring stack (2016)

Ning Zhang documented the monitoring stack in December 2016. Walmart forked Yahoo's Kafka Manager and added custom monitoring graphs. jmxtrans bridges Kafka JMX metrics to Graphite and Ganglia backends; Grafana is layered on top of Graphite for dashboards. Alerts covered: Kafka process down, under-replicated partitions, leader loss, low disk space, and high CPU utilisation.

No sourced material describes how the monitoring stack evolved post-2016 when Walmart moved to WCNP and Kubernetes.

Producer and consumer tuning

For the replenishment system, Pattnaik documented explicit tuning of linger.ms, batch.size (16,000 bytes), and acks=all on the producer side, with max.poll.records, max.poll.interval.ms, and session timeout configured on the consumer side.

Broker hardware

For the inventory system, Suman Pattnaik specified that brokers use multiple disks with RAID configurations for log.dirs storage, with consumer counts aligned to partition counts and CPU/memory maintained at around 50% utilisation headroom.

Developer experience

The self-serving OneOps model (documented by Ning Zhang in 2016) gave teams a GUI for deploying dedicated Kafka clusters, covering operations, monitoring, auto-repair, auto-replacement, and auto-scaling. Anil Kumar's team built internal tooling for tracking pipeline flows, SLA metrics, message send/receive latencies per producer and consumer, and alerting on backlogs and throughput degradation. A Kafka REST proxy was deployed so that non-JVM services could produce and consume without native Kafka client libraries.

Challenges and how they solved them

Consumer rebalancing causing lag at scale

Ravinder Matte's team described consumer group rebalancing as "the most common challenge in operationalising Kafka consumers at scale" in the June 2024 post. At Walmart's scale — 25,000+ consumers — rebalances triggered by pod scaling, rolling deployments, or processing delays exceeding max.poll.interval.ms caused high consumer lag that disrupted SLAs.

Solution: The Messaging Proxy Service decouples Kafka partition consumption from consumer services. Most rebalances are now eliminated except for rare cluster restarts or network events.

Slow Kafka Streams standby recovery

Changelog topics holding gigabytes of state for the Customer Data Platform caused very slow recovery when a Kafka Streams node failed, as the standby had to replay the entire log.

Solution: Cold Bootstrap — the standby copies the active node's RocksDB snapshot directly via JSch, then resumes from the saved offset. This avoids replaying the changelog, achieves zero event loss, and eliminates the need for indefinite changelog retention. (Deepak Goyal, Kafka Summit NYC 2019.)

Inability to elastically scale Kafka Streams

Kafka Streams lacked native repartitioning support, which prevented the Customer Data Platform from scaling its stateful applications without full restarts.

Solution: Walmart added dynamic repartitioning to Kafka Streams, redistributing state across new partitions at runtime. This was later contributed to the community via the KIP process. (Deepak Goyal, Kafka Summit NYC 2019.)

Availability vs. consistency in fraud detection

Kafka Streams' default behaviour blocks reads during consumer rebalances. For a fraud detection application running on every transaction, this was incompatible with latency SLAs.

Solution: Walmart contributed KIP-535 and KIP-562 to Apache Kafka, explicitly choosing availability over strict consistency for this workload. "You're basically trading consistency for availability," Brar stated in a TechTarget interview. (Navinder Pal Singh Brar, Kafka Summit 2020.)

Shared cluster resource contention

Early shared bare-metal clusters suffered from capacity competition between tenants, no authentication or isolation, and buggy clients exhausting resources.

Solution: Migrated to dedicated clusters via OneOps, giving each team its own Kafka cluster with GUI-driven operations, auto-repair, and auto-scaling. (Ning Zhang, November 2016.)

Cassandra CDC record fan-out

A single Cassandra write in a 3-region, RF=3 deployment generates 9 CDC records, arriving out of order and containing only changed columns.

Solution: Debezium 1.9 in continuous log processing mode, combined with Redis-backed RedisBloom Bloom filters partitioned hourly, deduplicates records at 60,000 changes per second in production. Walmart also enhanced Debezium to process commit logs without waiting for file completion. (Scott Harvester and Nitin Chhabra, July 2022.)

Poison-pill messages causing head-of-line blocking

Unprocessable messages blocked partition consumption in direct consumer deployments.

Solution: The Messaging Proxy Service routes failed messages to a Dead Letter Queue with retry logic, isolating them from the main processing path. (Ravinder Matte et al., June 2024.)

Full tech stack

Category	Tools	Notes
Message broker	Apache Kafka	Kafka 0.10.1.0 documented in 2016; current version not specified in sourced material
Stream processing	Kafka Streams, Apache Storm (Trident), Apache Spark Streaming	Kafka Streams: Customer Data Platform, fraud detection, inventory; Storm: Druid ingestion; Spark: inventory and A/B testing
State store	RocksDB	Local state store for Kafka Streams applications in the CDP
Connectors	Kafka Connect (HDFS sink), Debezium 1.9 (Cassandra CDC source)	HDFS connector for long-term storage; Debezium enhanced for continuous log processing
Cross-datacenter replication	Kafka MirrorMaker (patched)	Internal patches added complete topic renaming to prevent name collisions across sites (2016)
HTTP interface	Kafka REST Proxy	Enables non-JVM producers and consumers (2016)
Analytics	Apache Druid	Sub-second OLAP queries on nearly 1 billion+ events/day; pre-aggregation at ingestion
Operational database	Apache Cassandra	Primary operational store for inventory and customer data; CDC source for Debezium
Deduplication	Redis, RedisBloom	Bloom-filter deduplication for Cassandra CDC fan-out at 60,000 changes/second
Storage sinks	Apache HDFS, Apache Hudi	HDFS via Kafka Connect for long-term retention; Hudi for lakehouse table format (as of 2023)
Monitoring (2016)	Kafka Manager (Yahoo fork, enhanced), jmxtrans, Graphite, Ganglia, Grafana	Custom monitoring graphs added to Kafka Manager; Grafana layered on Graphite
Cluster coordination	Apache ZooKeeper	Referenced in 2019 material; current coordination mechanism not specified
Deployment / infra (legacy)	OneOps (OpenStack-based)	Self-serve Kafka cluster deployment with GUI, auto-repair, and auto-scaling (pre-WCNP)
Deployment / infra (current)	WCNP (Walmart Cloud Native Platform)	Kubernetes-based multi-cloud orchestration spanning Azure and Google Cloud
Internal proxy	Messaging Proxy Service (MPS)	Internal HTTP proxy decoupling Kafka consumption from consumer pod scaling (2024)
Client libraries	Spring Kafka, Akka Streams	Spring Kafka in replenishment system; Akka Streams for early reactive microservice consumption (2016)
Public cloud	Azure, Google Cloud	Both used for WCNP workloads; Azure hosts Cassandra CDC pipeline

Key contributors

Ning Zhang (Software Engineer, Walmart Global Tech): Authored two foundational 2016 posts on Walmart's Kafka ecosystem and multi-datacenter architecture
Anil Kumar (Global eCommerce Engineer, Walmart Labs): Authored the 2016 Confluent post on Kafka for item setup and near-real-time search indexing
Suman Pattnaik (Director of Engineering, Walmart): Architected real-time inventory and replenishment platforms; co-presented at Kafka Summit SF 2019; authored Confluent blog posts (2020, 2022)
Prasanna Subburaj (Engineer, Walmart): Co-presented "When Kafka meets the scaling and reliability needs of world's largest retailer" at Kafka Summit SF 2019
Deepak Goyal (Engineer, Walmart Labs, Customer Backbone team): Developed Cold Bootstrap, dynamic repartitioning, and rack-aware task assignment for Kafka Streams; presented at Kafka Summit NYC 2019
Navinder Pal Singh Brar (Staff Engineer, Walmart Global Tech, Customer Data Platform): Founding member of Walmart's CDP; contributed KIP-535 and KIP-562 to Apache Kafka; holds 4 US patents; Confluent Community Catalyst
Amaresh Nayak (Engineer, Walmart Global Tech): Authored the 2017 post on Druid + Kafka event stream analytics
Scott Harvester (Engineer, Walmart Global Tech): Co-authored the 2022 Cassandra CDC solution post
Nitin Chhabra (Engineer, Walmart Global Tech): Co-authored the 2022 Cassandra CDC post
Ravinder Matte (Engineer, Walmart Global Tech): Lead author of the 2024 Messaging Proxy Service post
Chris Kasten (VP of Walmart Cloud): Discussed Walmart's real-time platform in Kafka Summit SF 2019 keynote Q&A

Key takeaways for your own Kafka implementation

Decouple consumers from partitions before you hit rebalance problems. Walmart's Messaging Proxy Service (2024) addressed a challenge that many large-scale Kafka deployments encounter: consumer group rebalances at pod-scale. If you're running containerised consumers on Kubernetes, consider whether a proxy layer that absorbs Kafka consumption and exposes stateless REST endpoints would reduce operational complexity before the problem becomes acute.
Consider Kafka Streams changelog recovery carefully when your state is large. Walmart's Cold Bootstrap approach (replacing changelog replay with direct RocksDB snapshot copy via JSch) addressed a real recovery bottleneck. If your Kafka Streams state stores are measured in gigabytes, the default changelog replay strategy may result in slow standby recovery. Understanding this trade-off early lets you design around it.
Many small clusters often work better than one large shared cluster. From 2016, Walmart operated purpose-built clusters per team and per pipeline rather than a shared monolith. This limits blast radius, allows per-cluster tuning, and avoids cross-team coordination overhead in operations like broker upgrades and partition reassignments.
Custom partitioners are worth the complexity when ordering or write-isolation matters. Walmart's murmur-hash partitioner, which routes each item-store combination to a single partition, prevented database deadlocks that would otherwise occur from concurrent writes. If your consumers write to a keyed data store, a custom partitioner aligned to the store's key can eliminate a class of concurrency bugs.
Open-source contributions can close gaps faster than internal workarounds. Walmart's KIP-535 and KIP-562 contributions (availability-first Kafka Streams for fraud detection) were eventually merged into Apache Kafka. Contributing upstream meant the solution became part of the core, reducing the maintenance burden of carrying a private fork.

Sources and further reading

Ning Zhang, "Tech Transformation: Real-time Messaging at Walmart," Walmart Global Tech Blog, November 2016
Ning Zhang, "Kafka Ecosystem on Walmart's Cloud," Walmart Global Tech Blog, December 2016
Anil Kumar, "Apache Kafka for Item Setup at Walmart," Confluent Blog (Walmart Labs), October 2016
Amaresh Nayak, "Event Stream Analytics at Walmart with Druid," Walmart Global Tech Blog, November 2017
Suman Pattnaik and Prasanna Subburaj, "When Kafka Meets the Scaling and Reliability Needs of World's Largest Retailer," Kafka Summit SF 2019
Deepak Goyal, "Kafka Streams at Scale: Walmart's Approach," Kafka Summit NYC 2019
Navinder Pal Singh Brar, "Multi-tenant Kafka Streams CDP at Walmart," Strata Data Conference New York, September 2019
Suman Pattnaik, "Real-time Inventory Management at Walmart Using Kafka," Confluent Blog, May 2020
Navinder Pal Singh Brar, "Availability-first Kafka Streams at Walmart Scale," Kafka Summit 2020
Sean Michael Kerner (citing Navinder Pal Singh Brar), "Kafka Users Northrop Grumman, Walmart Highlight Event Streaming," TechTarget, August 2020
Suman Pattnaik, "How Walmart Uses Kafka for Real-time Omnichannel Replenishment," Confluent Blog, May 2022
Scott Harvester and Nitin Chhabra, "Walmart's Cassandra CDC Solution," Walmart Global Tech Blog, July 2022
Samuel Guleff, "Lakehouse at Fortune 1 Scale," Walmart Global Tech Blog, May 2023
Ravinder Matte, Vilas Athavale, Sid Anand et al., "Reliably Processing Trillions of Kafka Messages Per Day," Walmart Global Tech Blog, June 2024

If you're monitoring a Kafka environment at scale, Kpow offers a free 30-day trial that connects to any Kafka cluster in minutes and deploys via Docker, Helm, or JAR.

Kafdrop: Review, pricing, and best alternatives in 2026

Kafka management console: what to look for in a tool

Kafka dashboard: the features that matter in production

Dead letter queues in Kafka: patterns and pitfalls

Kafka UI: The Ultimate Guide

Clone to topic for Dead Letter Queues in Apache Kafka

Kafka cluster monitoring

A detailed guide to Kafka producer monitoring

Kafka consumer monitoring and performance tuning

Kafka broker monitoring

Apache Kafka architecture: a complete guide to internals, components, and deployment

Kafka monitoring: a complete guide for platform engineers