
How Grab uses Apache Kafka in production
Table of contents
Grab's engineering team built one of Southeast Asia's most sophisticated real-time data platforms on top of Apache Kafka, and then spent five years publishing exactly how they did it. The Coban team's body of work is unusually candid: cross-AZ traffic costs that consumed half their Kafka budget, broker crashes that required manual intervention, partition rebalancing that took six hours and caused prolonged latency spikes. What makes Grab's story worth studying is not just the scale, but the systematic way they worked through each operational constraint and wrote it up.
The platform processes more than 300 billion events per week at terabytes of ingress per hour. As of late 2023, the Coban control plane managed over 5,000 data streaming resources across every Grab vertical.
Company Overview
Grab is Southeast Asia's leading superapp, operating across mobility, food delivery, package delivery, payments, and financial services in eight countries. The company serves hundreds of millions of consumers across more than 400 cities.
Kafka sits at the centre of Grab's event-driven architecture. Every time a user books a ride, a series of events ripples through booking state machines, driver notification systems, rewards computation, personalisation pipelines, and analytics dashboards. The Coban team, whose name comes from a waterfall in Indonesia, was formed to build and operate the streaming infrastructure underneath all of it. Their mandate: provide a NoOps, managed platform for seamless, secure access to event streams in real time, for every team at Grab.
Key Kafka milestones:
Grab's Kafka Use Cases
Kafka at Grab is not a single-purpose bus. The Coban platform serves transactional and analytical use cases simultaneously, across every vertical in the company.
Real-time event sourcing is the foundational use case. Every booking, payment, food order, and driver status change produces events that fan out to multiple downstream consumers. Booking state machines, reward point computation, feed generation, and personalisation all run asynchronously from a centralised Kafka event log. Services apply changes from this log to their own state independently, without blocking each other.
Change Data Capture eliminated a dual-write problem that was widespread across Grab's MySQL-heavy backend. Services previously had to write to both MySQL and Kafka atomically, which required two-phase commit and created consistency risks. With Debezium connectors running on Kafka Connect, services write only to MySQL, and Coban's CDC pipeline captures binlog changes and publishes them to Kafka automatically, including before-and-after row snapshots. GrabFood's Elasticsearch index synchronisation was one of the most significant early adopters.
Disaster recovery and cross-cluster migration are handled through MirrorMaker2 deployed on Kafka Connect. Critical services can fail over across AWS regions with zero message loss and seamless offset resumption. Coban validated this during a company-wide chaos engineering campaign in 2021. MM2 has also been used to migrate streams from self-managed Kafka clusters into the Coban platform with zero producer or consumer downtime.
Real-time analytics pipelines connect Kafka to Apache Pinot. Partner Gateway API metrics flow from Kafka producers through Flink (for serialisation and transformation) into Pinot topics, where they power sub-second time-series dashboards and anomaly detection for partners monitoring their API integrations.
Data lake ingestion is one of Coban's core responsibilities. The platform is the entry point to Grab's data lake, ingesting events from all services for storage and batch or streaming analysis downstream.
Stream processing pipelines for ML features and business intelligence run as either Apache Flink jobs or Go-based Stream Processing Framework (SPF) pipelines on Kubernetes. Time-windowed aggregations, stream joins, filtering, and mapping are all standard operations. For example, personalising the Grab app landing page requires counting a user's interactions with widget elements across a recent time window, pre-aggregated rather than computed on demand.
Real-time data quality monitoring was deployed in 2025. FlinkSQL jobs consume production Kafka topics, apply data contract rules (both syntactic and semantic), and halt propagation of invalid data before it cascades to downstream consumers. An LLM analyses Kafka stream schemas and anonymised sample data to recommend semantic validation rules, reducing the manual effort of defining per-field rules across hundreds of topics. As of late 2025, the system actively monitors more than 100 critical Kafka topics.
Scale
Grab's streaming platform operates at a scale that makes conventional approaches expensive and brittle.
Architecture
Infrastructure layer. Grab's platform is hosted on AWS in a single region spanning three Availability Zones. All Kafka brokers and clients are distributed across these three AZs for high availability. Each Kafka partition has three replicas, distributed rack-aware, with one replica per AZ.
Kafka runs on Amazon EKS using the Strimzi operator. Each production Kafka cluster runs on a dedicated EKS cluster, with one Kafka broker per dedicated worker node, enforced via Kubernetes taints and tolerations. Storage uses AWS EBS gp3 volumes dynamically provisioned via the EBS CSI driver. This replaced an earlier design using NVMe instance store volumes, which could not survive worker node replacement without manual intervention.
Kafka clusters are accessible across VPC boundaries via AWS Network Load Balancers, with each broker advertised by a private zonal DNS name and a distinct TCP port. This enables deterministic per-broker connections required by Kafka producers and consumers. For cross-account access (for example, GrabKios in its own AWS account), a VPC Endpoint Service is used in place of VPC peering.
Control plane. The Coban control plane has three tiers:
All streaming resources are declared as Terraform code under the hood. Heimdall abstracts this from users who prefer a UI or REST API, and Khone provides the Git-based audit trail, peer review workflow, and CI pipeline. The Coban UI shows real-time metrics for Kafka clusters and topic byte rates, and integrates with Grab's monitoring stack.
Stream processing layer. Two frameworks handle stream processing at Grab. Apache Flink is used for stateful stream processing, including the data quality monitoring system, Partner Gateway analytics, and the ML feature pipelines. The Go-based Stream Processing Framework (SPF) runs as Kafka consumer pods on Kubernetes and handles hundreds of simpler pipelines with filtering, mapping, aggregation, and windowing operations.
AutoMQ adoption (2025). For a subset of clusters requiring high elasticity, Coban adopted AutoMQ, a Kafka-compatible broker with a shared EBS WAL and S3 storage architecture. With AutoMQ, partition reassignment no longer requires moving data between brokers, reducing migration time from six hours to seconds and delivering a 3x throughput improvement.
Special Techniques
Rack-aware closest-replica fetching. By default, Kafka consumers fetch from partition leaders, which reside in a different AZ 67% of the time. This generated cross-AZ traffic that represented approximately 50% of total Kafka platform cost. Coban upgraded their legacy Kafka clusters to version 3.1 and configured replica.selector.class=RackAwareReplicaSelector on brokers, with broker.rack set to the AZ ID. On the consumer side, the internal Golang Kafka SDK was updated so that teams can enable closest-replica fetching by exporting a single environment variable, with the SDK dynamically setting client.rack from EC2 instance metadata at startup. The same logic was applied to Flink pipelines and Kafka Connect connectors. After rollout, consumer cross-AZ traffic cost dropped to near zero. Note: up to 500ms of additional end-to-end latency can appear for non-latency-sensitive flows due to replication lag, so latency-sensitive pipelines continue to fetch from partition leaders.
Protobuf SerDes with Confluent Schema Registry. Grab standardised on Protobuf for all Kafka message serialisation and deserialisation. Confluent Schema Registry sits at the centre of the Kafka Connect ecosystem, enabling format conversions: Protobuf to Avro, Protobuf to JSON, and Protobuf to Parquet. For cross-cloud ingestion to Azure Event Hubs (which does not support Protobuf natively), Coban built an in-house converter that deserialises Protobuf bytes using a schema retrieved from the registry, traverses the message tree recursively, converts each field to a JSON node, and serialises the result.
MirrorMaker2 deployed per-connector on Kafka Connect. Rather than running the standard three-connector MM2 bundle, Coban deploys MirrorSourceConnector and MirrorCheckpointConnector independently on the Kafka Connect framework. This provides finer control over each connector and enables IaC management via Terraform, with offset mirroring handled separately so consumers can seamlessly resume from backup clusters.
Zero-trust mTLS and OPA authorisation. mTLS is enforced for all client-broker and broker-broker communications via Strimzi. Short-lived ephemeral certificates are issued by a HashiCorp Vault PKI engine, with a Root CA per environment signed down to cluster-level and client-level intermediate CAs. Open Policy Agent (OPA) is integrated with Kafka brokers to enforce topic-level, least-privilege authorisation: each client is whitelisted for the specific topics and permissions (produce or consume) it strictly needs.
ML-based predictive autoscaling for Flink. A time-series forecasting model trained on Kafka source topic throughput predicts CPU demand ahead of time. A separate regression model maps the forecast to required TaskManager CPU. By scaling vertically before demand spikes rather than reacting after, the system avoids the reactive scaling spirals common with HPA. Deployed to the majority of applicable Flink pipelines, the system delivered more than 35% reduction in cloud CPU cost.
LLM-assisted data contract rule generation. The data quality monitoring system uses an LLM to analyse Kafka stream schemas and anonymised sample data, recommending semantic test rules that would be impractical to define manually at scale across hundreds of topics. FlinkSQL auto-generates the test definitions from the approved contracts.
Operating Practices
Infrastructure-as-Code for all streaming resources. Every Kafka topic, cluster, Kafka Connect connector, Flink pipeline, and CDC pipeline is declared as Terraform. Changes go through Git MRs, peer review, and CI pipeline application via Atlantis. The Coban UI and Heimdall API abstract this for users who prefer not to write Terraform directly, but the underlying system remains fully auditable and reproducible.
Cruise Control for partition balancing. Cruise Control is deployed alongside each Kafka cluster for partition leader rebalancing. It is used during Kafka upgrades and rolling updates to maintain cluster balance.
AWS Node Termination Handler for graceful broker shutdown. NTH runs in Queue Processor mode, capturing ASG scale-in events, manual instance terminations, and EC2 maintenance events via SQS and EventBridge. When a node is going to be terminated, NTH cordons and drains it, which sends a SIGTERM to the Kafka pod and triggers a graceful shutdown. Strimzi's terminationGracePeriodSeconds=180 gives brokers enough time to migrate all partition leaders, typically around 60 seconds for 600 partition leaders. The result is that unexpected node termination no longer requires engineer intervention.
Per-AZ consumer load monitoring. After enabling closest-replica fetching, CPU utilisation became skewed across AZs because consumers in well-provisioned AZs handled a disproportionate share of reads. Coban updated their internal Kafka SDK to expose AZ as a metric, enabling operators to proactively rebalance consumers across zones.
VPA with fixed pod count for Kafka consumer pipelines. SPF pipelines match pod count to the number of partitions in the source Kafka topic, fetched at runtime via the Kafka API. Vertical Pod Autoscaler handles CPU and memory allocation per pod based on historical load trends. This approach delivered approximately 45% reduction in total resource usage versus resources requested, compared to the previous HPA-based approach.
Self-service platform with high reliability targets. The Heimdall API maintained greater than 99.95% uptime in H1 2023. End-to-end workflow success rate for self-service resource creation, change, and deletion held above 90% during the same period, even as monthly active users nearly quadrupled.
Chaos engineering for DR validation. A company-wide chaos engineering campaign in 2021 validated the robustness of the Kafka platform, including MirrorMaker2-based cross-region failover. Coban's Kafka demonstrated resilience across multiple chaos rounds.
Challenges and Solutions
Challenge: Cross-AZ network traffic consumed half the Kafka budget. In Grab's initial design, Kafka consumers fetched from partition leaders by default. With three AZs and partition leaders distributed across them, any given consumer had a 67% chance of fetching from a different AZ. At Grab's scale, this consumer-side cross-AZ traffic dominated the bill and represented roughly 50% of total Kafka platform cost.
Root cause: the default Kafka consumer configuration fetches only from partition leaders, regardless of replica proximity.
Solution: Coban upgraded to Kafka 3.1, configured RackAwareReplicaSelector on brokers, and updated the internal Golang SDK to expose a single environment variable that enables closest-replica fetching, setting client.rack dynamically from EC2 instance metadata. Applied to all Coban-managed pipelines and Kafka Connect connectors.
Outcome: consumer cross-AZ traffic cost dropped to near zero. Side effect to watch: end-to-end latency increases up to 500ms for flows that now read from replicas rather than leaders, due to replication lag. Latency-sensitive pipelines are configured to continue leader fetching.
Challenge: Kafka broker crashes on Kubernetes required manual engineer intervention. When EKS worker nodes were unexpectedly terminated (hardware failure, AWS maintenance), the Kafka pod could not restart on the replacement node for two reasons: the NLB target groups still pointed to the dead node, and the Kubernetes PVC remained bound to the now-missing NVMe instance store PV. The cluster ran degraded (two of three brokers) until a Coban engineer manually reconfigured the target groups and deleted the zombie PVC.
Root cause: NVMe instance store volumes are local to the EC2 instance and cannot be reattached to a replacement node. Static Kubernetes PV provisioning meant the PVC stayed bound to the missing volume.
Solution: migrated broker storage from NVMe instance store to AWS EBS gp3 volumes, dynamically provisioned via the EBS CSI driver and managed by Strimzi. Integrated AWS Load Balancer Controller (LBC) for dynamic NLB target group mapping. Added AWS NTH in Queue Processor mode to gracefully drain nodes before termination.
Outcome: unexpected node termination no longer requires engineer intervention. The cluster self-heals automatically.
Challenge: Partition rebalancing took six hours and caused prolonged latency spikes. In traditional Kafka, partition reassignment moves data between broker nodes. During rebalancing events, this caused multi-hour periods of elevated latency and significant operational risk. Over-provisioning based on peak usage also meant significant resource waste during off-peak periods.
Root cause: Kafka's replication-based storage model couples partition data to specific broker nodes. Moving a partition requires copying potentially gigabytes of data between brokers.
Solution: adopted AutoMQ for a subset of clusters. AutoMQ uses a shared EBS WAL and S3 storage layer, where all brokers read from shared storage. Partition reassignment no longer requires moving data between brokers.
Outcome: partition migration time dropped from six hours to seconds. AutoMQ's self-balancing mechanism periodically triggers rebalancing without the risk associated with traditional partition movement. 3x throughput improvement reported.
Challenge: Reactive HPA scaling spirals for Flink pipelines. HPA-triggered Flink restarts caused CPU and latency spikes. Those spikes triggered further HPA scale events, creating feedback loops that destabilised pipelines. The problem was compounded by the fact that Flink's Kafka connector has fixed parallelism bounded by partition count, so horizontal scaling was often impossible, leaving only vertical scaling as an option.
Root cause: reactive scaling based on CPU or latency thresholds can amplify instability rather than resolve it, particularly for Kafka-bound Flink jobs where restart itself causes temporary load spikes.
Solution: ML-based predictive vertical autoscaler. A time-series model forecasts Kafka source topic throughput (which follows seasonal patterns). A regression model maps throughput to required CPU. TaskManager CPU is adjusted ahead of demand spikes.
Outcome: deployed to the majority of applicable Flink pipelines; more than 35% reduction in cloud CPU cost; scaling spirals eliminated.
Challenge: Invalid data silently propagated through Kafka to downstream consumers. Without effective data quality monitoring, bad data would enter Kafka topics and cascade to multiple downstream services before anyone detected the problem. Identifying the source, notifying the right team, and containing the blast radius was slow and manual.
Root cause: no systematic mechanism existed for declaring and testing data contracts on Kafka streams. Monitoring tools tracked throughput and lag but not data validity.
Solution: a data contract system where teams declare schemas and field-level semantic rules. FlinkSQL jobs auto-generated from these contracts consume production topics, detect violations in real time, and route errors to Grab's observability platform (Genchi), Slack, and S3 sinks. An LLM helps generate semantic rules to reduce the manual definition burden.
Outcome: the system now monitors more than 100 critical Kafka topics and can immediately identify and halt propagation of invalid data across multiple streams.
Challenge: Dual-write inconsistency between MySQL and Kafka. Services were writing state changes to both MySQL and a Kafka topic. Keeping these two writes atomic required two-phase commit, was non-trivial to implement correctly, and impacted service availability. Additionally, some downstream consumers needed before-and-after row snapshots that event-based producers could not easily supply.
Root cause: publishing to Kafka from application code forces producers to solve distributed transaction problems that are not their core concern.
Solution: Debezium CDC connectors running on Kafka Connect. Services write only to MySQL. Debezium reads MySQL binlogs and publishes changes to Kafka, including before-and-after row snapshots. DB schema changes, migrations, and outages are handled via Debezium's built-in binlog offset management.
Outcome: dual-write eliminated. GrabFood's Elasticsearch index synchronisation was one of the most significant early adopters.
Tech Stack
Key Contributors
Sources
If you operate Kafka in production and want visibility into consumer lag, broker health, topic throughput, and rebalancing events without writing your own tooling, take a look at Kpow for Apache Kafka. It's built for engineering teams who need operational control without the overhead.