Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Grab uses Apache Kafka in production

May 16th, 2026

xx min read

Grab's engineering team built one of Southeast Asia's most sophisticated real-time data platforms on top of Apache Kafka, and then spent five years publishing exactly how they did it. The Coban team's body of work is unusually candid: cross-AZ traffic costs that consumed half their Kafka budget, broker crashes that required manual intervention, partition rebalancing that took six hours and caused prolonged latency spikes. What makes Grab's story worth studying is not just the scale, but the systematic way they worked through each operational constraint and wrote it up.

The platform processes more than 300 billion events per week at terabytes of ingress per hour. As of late 2023, the Coban control plane managed over 5,000 data streaming resources across every Grab vertical.

Company Overview

Grab is Southeast Asia's leading superapp, operating across mobility, food delivery, package delivery, payments, and financial services in eight countries. The company serves hundreds of millions of consumers across more than 400 cities.

Kafka sits at the centre of Grab's event-driven architecture. Every time a user books a ride, a series of events ripples through booking state machines, driver notification systems, rewards computation, personalisation pipelines, and analytics dashboards. The Coban team, whose name comes from a waterfall in Indonesia, was formed to build and operate the streaming infrastructure underneath all of it. Their mandate: provide a NoOps, managed platform for seamless, secure access to event streams in real time, for every team at Grab.

Key Kafka milestones:

Date	Event
2020-01	"Plumbing at Scale" published; Go-based Stream Processing Framework serving 300B+ events/week
2020-10	VPA with fixed pod count reduces Kafka consumer resource waste by 45%
2021	Kafka Connect initiative launched; Debezium CDC and MirrorMaker2 productionised
2021-H2	Company-wide chaos engineering validates Kafka platform resilience and DR failover
2022-02	Cross-VPC Kafka exposure via AWS VPC Endpoint Service published
2022-04	Kafka Connect blog: CDC, message mirroring, Azure Event Hubs connector
2022-12	Zero-trust mTLS + OPA + Strimzi security model deployed
2023-07	Closest-replica fetching rolled out; consumer cross-AZ cost reduced to near zero
2023-11	Coban control plane managing 5,000+ streaming resources; monthly active users quadrupled
2023-12	AWS NTH + EBS storage + Load Balancer Controller eliminate manual broker intervention
2025-09	AutoMQ adopted; partition migration time: 6 hours to seconds; 3x throughput
2025-10	ML predictive autoscaler for Flink: more than 35% CPU cost reduction
2025-11	Real-time data quality monitoring deployed across 100+ Kafka topics

Grab's Kafka Use Cases

Kafka at Grab is not a single-purpose bus. The Coban platform serves transactional and analytical use cases simultaneously, across every vertical in the company.

Real-time event sourcing is the foundational use case. Every booking, payment, food order, and driver status change produces events that fan out to multiple downstream consumers. Booking state machines, reward point computation, feed generation, and personalisation all run asynchronously from a centralised Kafka event log. Services apply changes from this log to their own state independently, without blocking each other.

Change Data Capture eliminated a dual-write problem that was widespread across Grab's MySQL-heavy backend. Services previously had to write to both MySQL and Kafka atomically, which required two-phase commit and created consistency risks. With Debezium connectors running on Kafka Connect, services write only to MySQL, and Coban's CDC pipeline captures binlog changes and publishes them to Kafka automatically, including before-and-after row snapshots. GrabFood's Elasticsearch index synchronisation was one of the most significant early adopters.

Disaster recovery and cross-cluster migration are handled through MirrorMaker2 deployed on Kafka Connect. Critical services can fail over across AWS regions with zero message loss and seamless offset resumption. Coban validated this during a company-wide chaos engineering campaign in 2021. MM2 has also been used to migrate streams from self-managed Kafka clusters into the Coban platform with zero producer or consumer downtime.

Real-time analytics pipelines connect Kafka to Apache Pinot. Partner Gateway API metrics flow from Kafka producers through Flink (for serialisation and transformation) into Pinot topics, where they power sub-second time-series dashboards and anomaly detection for partners monitoring their API integrations.

Data lake ingestion is one of Coban's core responsibilities. The platform is the entry point to Grab's data lake, ingesting events from all services for storage and batch or streaming analysis downstream.

Stream processing pipelines for ML features and business intelligence run as either Apache Flink jobs or Go-based Stream Processing Framework (SPF) pipelines on Kubernetes. Time-windowed aggregations, stream joins, filtering, and mapping are all standard operations. For example, personalising the Grab app landing page requires counting a user's interactions with widget elements across a recent time window, pre-aggregated rather than computed on demand.

Real-time data quality monitoring was deployed in 2025. FlinkSQL jobs consume production Kafka topics, apply data contract rules (both syntactic and semantic), and halt propagation of invalid data before it cascades to downstream consumers. An LLM analyses Kafka stream schemas and anonymised sample data to recommend semantic validation rules, reducing the manual effort of defining per-field rules across hundreds of topics. As of late 2025, the system actively monitors more than 100 critical Kafka topics.

Scale

Grab's streaming platform operates at a scale that makes conventional approaches expensive and brittle.

Metric	Value
Events processed per week	300 billion+ (as of 2020; platform has grown since)
Data ingress rate	Terabytes per hour
Streaming resources managed by Coban	5,000+ (topics, Flink pipelines, CDC pipelines, Kafka Connect connectors, Zeppelin notebooks)
Kafka topics monitored for data quality	100+
Cross-AZ traffic as share of Kafka cost (before 2023)	~50%
Kafka consumer resource reduction after VPA adoption	~45%
Heimdall API uptime (H1 2023)	Above 99.95%
End-to-end workflow success rate (H1 2023)	Above 90%

Architecture

Infrastructure layer. Grab's platform is hosted on AWS in a single region spanning three Availability Zones. All Kafka brokers and clients are distributed across these three AZs for high availability. Each Kafka partition has three replicas, distributed rack-aware, with one replica per AZ.

Kafka runs on Amazon EKS using the Strimzi operator. Each production Kafka cluster runs on a dedicated EKS cluster, with one Kafka broker per dedicated worker node, enforced via Kubernetes taints and tolerations. Storage uses AWS EBS gp3 volumes dynamically provisioned via the EBS CSI driver. This replaced an earlier design using NVMe instance store volumes, which could not survive worker node replacement without manual intervention.

Kafka clusters are accessible across VPC boundaries via AWS Network Load Balancers, with each broker advertised by a private zonal DNS name and a distinct TCP port. This enables deterministic per-broker connections required by Kafka producers and consumers. For cross-account access (for example, GrabKios in its own AWS account), a VPC Endpoint Service is used in place of VPC peering.

Control plane. The Coban control plane has three tiers:

Component	Role
Coban UI	React SPA for self-service resource creation and monitoring
Heimdall	Go backend; exposes CRUD APIs with ABAC; converts UI actions to Terraform
Khone	Git repository; stores Terraform code and metadata; CI provisioner via Atlantis

‍

All streaming resources are declared as Terraform code under the hood. Heimdall abstracts this from users who prefer a UI or REST API, and Khone provides the Git-based audit trail, peer review workflow, and CI pipeline. The Coban UI shows real-time metrics for Kafka clusters and topic byte rates, and integrates with Grab's monitoring stack.

Stream processing layer. Two frameworks handle stream processing at Grab. Apache Flink is used for stateful stream processing, including the data quality monitoring system, Partner Gateway analytics, and the ML feature pipelines. The Go-based Stream Processing Framework (SPF) runs as Kafka consumer pods on Kubernetes and handles hundreds of simpler pipelines with filtering, mapping, aggregation, and windowing operations.

AutoMQ adoption (2025). For a subset of clusters requiring high elasticity, Coban adopted AutoMQ, a Kafka-compatible broker with a shared EBS WAL and S3 storage architecture. With AutoMQ, partition reassignment no longer requires moving data between brokers, reducing migration time from six hours to seconds and delivering a 3x throughput improvement.

Special Techniques

Rack-aware closest-replica fetching. By default, Kafka consumers fetch from partition leaders, which reside in a different AZ 67% of the time. This generated cross-AZ traffic that represented approximately 50% of total Kafka platform cost. Coban upgraded their legacy Kafka clusters to version 3.1 and configured replica.selector.class=RackAwareReplicaSelector on brokers, with broker.rack set to the AZ ID. On the consumer side, the internal Golang Kafka SDK was updated so that teams can enable closest-replica fetching by exporting a single environment variable, with the SDK dynamically setting client.rack from EC2 instance metadata at startup. The same logic was applied to Flink pipelines and Kafka Connect connectors. After rollout, consumer cross-AZ traffic cost dropped to near zero. Note: up to 500ms of additional end-to-end latency can appear for non-latency-sensitive flows due to replication lag, so latency-sensitive pipelines continue to fetch from partition leaders.

Protobuf SerDes with Confluent Schema Registry. Grab standardised on Protobuf for all Kafka message serialisation and deserialisation. Confluent Schema Registry sits at the centre of the Kafka Connect ecosystem, enabling format conversions: Protobuf to Avro, Protobuf to JSON, and Protobuf to Parquet. For cross-cloud ingestion to Azure Event Hubs (which does not support Protobuf natively), Coban built an in-house converter that deserialises Protobuf bytes using a schema retrieved from the registry, traverses the message tree recursively, converts each field to a JSON node, and serialises the result.

MirrorMaker2 deployed per-connector on Kafka Connect. Rather than running the standard three-connector MM2 bundle, Coban deploys MirrorSourceConnector and MirrorCheckpointConnector independently on the Kafka Connect framework. This provides finer control over each connector and enables IaC management via Terraform, with offset mirroring handled separately so consumers can seamlessly resume from backup clusters.

Zero-trust mTLS and OPA authorisation. mTLS is enforced for all client-broker and broker-broker communications via Strimzi. Short-lived ephemeral certificates are issued by a HashiCorp Vault PKI engine, with a Root CA per environment signed down to cluster-level and client-level intermediate CAs. Open Policy Agent (OPA) is integrated with Kafka brokers to enforce topic-level, least-privilege authorisation: each client is whitelisted for the specific topics and permissions (produce or consume) it strictly needs.

ML-based predictive autoscaling for Flink. A time-series forecasting model trained on Kafka source topic throughput predicts CPU demand ahead of time. A separate regression model maps the forecast to required TaskManager CPU. By scaling vertically before demand spikes rather than reacting after, the system avoids the reactive scaling spirals common with HPA. Deployed to the majority of applicable Flink pipelines, the system delivered more than 35% reduction in cloud CPU cost.

LLM-assisted data contract rule generation. The data quality monitoring system uses an LLM to analyse Kafka stream schemas and anonymised sample data, recommending semantic test rules that would be impractical to define manually at scale across hundreds of topics. FlinkSQL auto-generates the test definitions from the approved contracts.

Operating Practices

Infrastructure-as-Code for all streaming resources. Every Kafka topic, cluster, Kafka Connect connector, Flink pipeline, and CDC pipeline is declared as Terraform. Changes go through Git MRs, peer review, and CI pipeline application via Atlantis. The Coban UI and Heimdall API abstract this for users who prefer not to write Terraform directly, but the underlying system remains fully auditable and reproducible.

Cruise Control for partition balancing. Cruise Control is deployed alongside each Kafka cluster for partition leader rebalancing. It is used during Kafka upgrades and rolling updates to maintain cluster balance.

AWS Node Termination Handler for graceful broker shutdown. NTH runs in Queue Processor mode, capturing ASG scale-in events, manual instance terminations, and EC2 maintenance events via SQS and EventBridge. When a node is going to be terminated, NTH cordons and drains it, which sends a SIGTERM to the Kafka pod and triggers a graceful shutdown. Strimzi's terminationGracePeriodSeconds=180 gives brokers enough time to migrate all partition leaders, typically around 60 seconds for 600 partition leaders. The result is that unexpected node termination no longer requires engineer intervention.

Per-AZ consumer load monitoring. After enabling closest-replica fetching, CPU utilisation became skewed across AZs because consumers in well-provisioned AZs handled a disproportionate share of reads. Coban updated their internal Kafka SDK to expose AZ as a metric, enabling operators to proactively rebalance consumers across zones.

VPA with fixed pod count for Kafka consumer pipelines. SPF pipelines match pod count to the number of partitions in the source Kafka topic, fetched at runtime via the Kafka API. Vertical Pod Autoscaler handles CPU and memory allocation per pod based on historical load trends. This approach delivered approximately 45% reduction in total resource usage versus resources requested, compared to the previous HPA-based approach.

Self-service platform with high reliability targets. The Heimdall API maintained greater than 99.95% uptime in H1 2023. End-to-end workflow success rate for self-service resource creation, change, and deletion held above 90% during the same period, even as monthly active users nearly quadrupled.

Chaos engineering for DR validation. A company-wide chaos engineering campaign in 2021 validated the robustness of the Kafka platform, including MirrorMaker2-based cross-region failover. Coban's Kafka demonstrated resilience across multiple chaos rounds.

Challenges and Solutions

Challenge: Cross-AZ network traffic consumed half the Kafka budget. In Grab's initial design, Kafka consumers fetched from partition leaders by default. With three AZs and partition leaders distributed across them, any given consumer had a 67% chance of fetching from a different AZ. At Grab's scale, this consumer-side cross-AZ traffic dominated the bill and represented roughly 50% of total Kafka platform cost.

Root cause: the default Kafka consumer configuration fetches only from partition leaders, regardless of replica proximity.

Solution: Coban upgraded to Kafka 3.1, configured RackAwareReplicaSelector on brokers, and updated the internal Golang SDK to expose a single environment variable that enables closest-replica fetching, setting client.rack dynamically from EC2 instance metadata. Applied to all Coban-managed pipelines and Kafka Connect connectors.

Outcome: consumer cross-AZ traffic cost dropped to near zero. Side effect to watch: end-to-end latency increases up to 500ms for flows that now read from replicas rather than leaders, due to replication lag. Latency-sensitive pipelines are configured to continue leader fetching.

Challenge: Kafka broker crashes on Kubernetes required manual engineer intervention. When EKS worker nodes were unexpectedly terminated (hardware failure, AWS maintenance), the Kafka pod could not restart on the replacement node for two reasons: the NLB target groups still pointed to the dead node, and the Kubernetes PVC remained bound to the now-missing NVMe instance store PV. The cluster ran degraded (two of three brokers) until a Coban engineer manually reconfigured the target groups and deleted the zombie PVC.

Root cause: NVMe instance store volumes are local to the EC2 instance and cannot be reattached to a replacement node. Static Kubernetes PV provisioning meant the PVC stayed bound to the missing volume.

Solution: migrated broker storage from NVMe instance store to AWS EBS gp3 volumes, dynamically provisioned via the EBS CSI driver and managed by Strimzi. Integrated AWS Load Balancer Controller (LBC) for dynamic NLB target group mapping. Added AWS NTH in Queue Processor mode to gracefully drain nodes before termination.

Outcome: unexpected node termination no longer requires engineer intervention. The cluster self-heals automatically.

Challenge: Partition rebalancing took six hours and caused prolonged latency spikes. In traditional Kafka, partition reassignment moves data between broker nodes. During rebalancing events, this caused multi-hour periods of elevated latency and significant operational risk. Over-provisioning based on peak usage also meant significant resource waste during off-peak periods.

Root cause: Kafka's replication-based storage model couples partition data to specific broker nodes. Moving a partition requires copying potentially gigabytes of data between brokers.

Solution: adopted AutoMQ for a subset of clusters. AutoMQ uses a shared EBS WAL and S3 storage layer, where all brokers read from shared storage. Partition reassignment no longer requires moving data between brokers.

Outcome: partition migration time dropped from six hours to seconds. AutoMQ's self-balancing mechanism periodically triggers rebalancing without the risk associated with traditional partition movement. 3x throughput improvement reported.

Challenge: Reactive HPA scaling spirals for Flink pipelines. HPA-triggered Flink restarts caused CPU and latency spikes. Those spikes triggered further HPA scale events, creating feedback loops that destabilised pipelines. The problem was compounded by the fact that Flink's Kafka connector has fixed parallelism bounded by partition count, so horizontal scaling was often impossible, leaving only vertical scaling as an option.

Root cause: reactive scaling based on CPU or latency thresholds can amplify instability rather than resolve it, particularly for Kafka-bound Flink jobs where restart itself causes temporary load spikes.

Solution: ML-based predictive vertical autoscaler. A time-series model forecasts Kafka source topic throughput (which follows seasonal patterns). A regression model maps throughput to required CPU. TaskManager CPU is adjusted ahead of demand spikes.

Outcome: deployed to the majority of applicable Flink pipelines; more than 35% reduction in cloud CPU cost; scaling spirals eliminated.

Challenge: Invalid data silently propagated through Kafka to downstream consumers. Without effective data quality monitoring, bad data would enter Kafka topics and cascade to multiple downstream services before anyone detected the problem. Identifying the source, notifying the right team, and containing the blast radius was slow and manual.

Root cause: no systematic mechanism existed for declaring and testing data contracts on Kafka streams. Monitoring tools tracked throughput and lag but not data validity.

Solution: a data contract system where teams declare schemas and field-level semantic rules. FlinkSQL jobs auto-generated from these contracts consume production topics, detect violations in real time, and route errors to Grab's observability platform (Genchi), Slack, and S3 sinks. An LLM helps generate semantic rules to reduce the manual definition burden.

Outcome: the system now monitors more than 100 critical Kafka topics and can immediately identify and halt propagation of invalid data across multiple streams.

Challenge: Dual-write inconsistency between MySQL and Kafka. Services were writing state changes to both MySQL and a Kafka topic. Keeping these two writes atomic required two-phase commit, was non-trivial to implement correctly, and impacted service availability. Additionally, some downstream consumers needed before-and-after row snapshots that event-based producers could not easily supply.

Root cause: publishing to Kafka from application code forces producers to solve distributed transaction problems that are not their core concern.

Solution: Debezium CDC connectors running on Kafka Connect. Services write only to MySQL. Debezium reads MySQL binlogs and publishes changes to Kafka, including before-and-after row snapshots. DB schema changes, migrations, and outages are handled via Debezium's built-in binlog offset management.

Outcome: dual-write eliminated. GrabFood's Elasticsearch index synchronisation was one of the most significant early adopters.

Tech Stack

Technology	Role
Apache Kafka	Core event log and message bus for all Grab verticals
Kafka Connect	Framework for CDC, message mirroring, and data sink/source connectors
Debezium	MySQL CDC connector on Kafka Connect for row-level change capture
MirrorMaker2	Cross-cluster message and offset mirroring for DR and live migrations
Apache Flink	Stateful stream processing: data quality tests, analytics pipelines, ML feature computation
Apache Zeppelin	Interactive notebook interface for stream data exploration
Apache Pinot	Real-time OLAP; ingests from Kafka for Partner Gateway metrics dashboards
Strimzi	CNCF Kafka on Kubernetes operator; mTLS, rolling upgrades, certificate management
Kubernetes / Amazon EKS	Container orchestration for Kafka brokers and consumer pipelines
AWS EBS gp3	Persistent storage for Kafka broker data volumes
AWS NLB + Load Balancer Controller	Per-broker network routing; dynamic target group mapping
AWS Node Termination Handler	Graceful Kafka broker shutdown on node termination events
Cruise Control	Partition leader rebalancing and cluster metrics
Zookeeper	Kafka cluster coordination
HashiCorp Vault	PKI engine for ephemeral mTLS certificate issuance
Open Policy Agent	Topic-level Kafka authorisation (least-privilege ACLs)
Protobuf + Confluent Schema Registry	Message serialisation, schema versioning, and format conversion
Terraform + Atlantis	IaC provisioning for all streaming resources
Golang Kafka SDK	Internal SDK encapsulating mTLS, AZ rack-awareness, and VPA config
AutoMQ	Kafka-compatible broker with EBS WAL + S3 shared storage for elastic clusters
Apache Hudi	Data lake table format; target sink for Kafka Connect pipelines
AWS S3	Object storage for data lake ingestion and AutoMQ storage layer
Amazon SQS + EventBridge	NTH event routing for graceful broker shutdown
React (Coban UI)	Self-service web portal for topic, pipeline, and connector management

Key Contributors

Name	Title / Team	Contribution
Karan Kamath	Coban team	"Plumbing at Scale" (2020); "Optimally Scaling Kafka Consumer Applications" (2020); "How Kafka Connect helps move data seamlessly" (2022)
Fabrice Harbulot	Coban, Platform Engineering	"Exposing a Kafka Cluster via a VPC Endpoint Service" (2022); "Zero trust with Kafka" (2022); "Zero traffic cost for Kafka consumers" (2023); "An Elegant Platform" (2023); "Kafka on Kubernetes: Reloaded for fault tolerance" (2023)
Quang Minh Tran	Coban	Co-author: "Zero traffic cost for Kafka consumers" (2023)
Thang Le	Coban	Co-author: "Kafka on Kubernetes: Reloaded for fault tolerance" (2023)
Minh Khoi Nguyen	Coban	Co-author: "An Elegant Platform" (2023)
Wenli Wan	Coban	Co-author: "How Kafka Connect helps move data seamlessly" (2022)
Thanh Tung Dao	Coban	Co-author: "How Kafka Connect helps move data seamlessly" (2022); "Zero trust with Kafka" (2022)
Shubham Badkur	Coban	"Optimally Scaling Kafka Consumer Applications" (2020)
Yuanzhe Liu	Coban	Lead author: real-time data quality monitoring (2025)

Sources

#	URL	Author	Publication	Date
1	https://engineering.grab.com/plumbing-at-scale	Karan Kamath	Grab Engineering Blog	Jan 2020
2	https://engineering.grab.com/optimally-scaling-kafka-consumer-applications	Shubham Badkur	Grab Engineering Blog	Oct 2020
3	https://engineering.grab.com/exposing-kafka-cluster	Fabrice Harbulot	Grab Engineering Blog	Feb 2022
4	https://engineering.grab.com/kafka-connect	Wenli Wan, Karan Kamath, Thanh Tung Dao	Grab Engineering Blog	Apr 2022
5	https://engineering.grab.com/zero-trust-with-kafka	Fabrice Harbulot, Thanh Tung Dao	Grab Engineering Blog	Dec 2022
6	https://engineering.grab.com/zero-traffic-cost	Fabrice Harbulot, Quang Minh Tran	Grab Engineering Blog	Jul 2023
7	https://engineering.grab.com/an-elegant-platform	Fabrice Harbulot, Minh Khoi Nguyen	Grab Engineering Blog	Nov 2023
8	https://engineering.grab.com/kafka-on-kubernetes	Fabrice Harbulot, Thang Le	Grab Engineering Blog	Dec 2023
9	https://engineering.grab.com/pinot-partnergateway-tech-blog	Grab Engineering	Grab Engineering Blog	2024
10	https://engineering.grab.com/ml-predictive-autoscaling-for-flink	Grab Engineering	Grab Engineering Blog	Oct 2025
11	https://www.automq.com/blog/how-grab-uses-automq-solve-kafka-challenges	AutoMQ Team	AutoMQ Blog	Sep 2025
12	https://www.infoq.com/news/2025/12/grab-kafka-data-quality/	InfoQ Staff	InfoQ	Dec 2025

‍

If you operate Kafka in production and want visibility into consumer lag, broker health, topic throughput, and rebalancing events without writing your own tooling, take a look at Kpow for Apache Kafka. It's built for engineering teams who need operational control without the overhead.

Kafka cluster monitoring

A detailed guide to Kafka producer monitoring

Kafka consumer monitoring and performance tuning

Kafka broker monitoring

Apache Kafka architecture: a complete guide to internals, components, and deployment

Kafka monitoring: a complete guide for platform engineers

Data governance for Kafka: introducing lineage support in Factor Platform

Defense in depth: unifying RBAC and data policies for transparent governance

CMAK: Review, pricing, and best alternatives in 2026

Kadeck: Review, pricing, and best alternatives in 2026

AKHQ: Review, pricing, and best alternatives in 2026

Redpanda Console: Review, pricing, and best alternatives in 2026