
How Wix uses Apache Kafka in production
Table of contents
Wix runs Apache Kafka as the event backbone for more than 2,200 microservices, processing 66 billion messages every day across 50,000 topics and 500,000 partitions. The scale is notable, but the more interesting engineering story is how Wix achieved it: by building Greyhound, an open-source Kafka SDK that abstracts the native client for every service in the fleet, and by migrating the entire platform to Confluent Cloud in 2021 without taking a single service offline.
Company overview
Wix is a cloud-based website creation and hosting platform used by more than 200 million registered users worldwide. At its core, the platform is a collection of loosely coupled product surfaces: site builder, e-commerce, bookings, blog, CRM, and several hundred third-party app integrations. Serving those surfaces requires Wix's engineering organisation to coordinate over 2,200 microservices, deployed across 18 data centres and points of presence, spanning two cloud providers (Google Cloud and AWS) and four geographic regions.
Kafka became the communication layer for those services as the microservices architecture matured. By 2020, 1,500 services were exchanging events through Kafka. By the time Natan Silnitsky presented at Kafka Summit London in April 2022, that number had grown to 2,200+ services, 50,000 topics, and 500,000+ partitions, handling 15 billion business events per day. The 66 billion total daily message count includes the internal infrastructure traffic that flows through Kafka alongside business events.
Wix's Kafka use cases
Microservices event-driven communication
The primary use of Kafka at Wix is replacing synchronous RPC between services with asynchronous events. Rather than service A calling service B directly, A publishes a domain event to a Kafka topic and any interested downstream service subscribes. This decoupling allows Wix to evolve services independently and absorb traffic spikes without cascading failures across the dependency graph.
Materialized views for query offloading
One concrete problem Kafka solved was query volume on Wix's MetaSite service, which holds metadata for every site on the platform: owner, version, and installed applications. Before Kafka-based projections, this single service was receiving over 1 million requests per minute from other services that needed a slice of that data.
The solution was to stream all MetaSite change events to a Kafka topic and let each downstream consumer build its own purpose-built read model. Services that previously queried MetaSite directly now maintain a local projection that is updated via Kafka, reducing their dependency to an eventual-consistency relationship rather than a real-time RPC call.
Change data capture
Wix uses Debezium connectors to capture row-level database changes and publish them as Kafka events. Services downstream can consume the CDC topic and produce their own clean domain event contracts rather than exposing raw database schemas as the inter-service API.
Protobuf-backed API contracts
At Wix, a Kafka topic schema is an API contract. All Kafka messages are serialised with Protobuf, and the .proto file is the canonical definition shared across Kafka, gRPC, and external-facing APIs. This means a new domain can be onboarded and integrated with surrounding services within a few days.
Real-time ML feature store
Wix rebuilt its online feature store around Kafka and Apache Flink. The system processes billions of events daily and maintains over 3,000 ML features with near-real-time update latency, feeding personalisation and recommendation models across the platform.
Global data mesh
Wix's platform spans four regions and two cloud providers. Rather than each service knowing where to find its upstreams, Kafka provides the transport layer for a global data mesh. Services publish to a logical topic namespace and Kafka handles the cross-region replication, meaning producers and consumers remain unaware of the underlying geographic topology.
Scale and throughput
Wix's Kafka architecture
The Greyhound SDK
No Wix service talks to Kafka through the native client directly. Every JVM service goes through Greyhound, the open-source Kafka SDK that Wix's data streaming team built and maintains. Greyhound is written in Scala using the ZIO functional library and adds a higher-level interface on top of the native Kafka client, providing:
- Concurrent message processing: configurable parallelism per consumer handler, not limited by partition count
- Retry policies: first-class retry configuration on both consumers and producers, using dedicated retry topics with back-off rather than blocking partition processing
- Batch processing: optional batch consumption for throughput-sensitive pipelines
- Context propagation and metrics: distributed tracing context forwarded through events; Prometheus gauges emitted per topic-consumer-handler
For non-JVM services, Greyhound ships as a sidecar proxy. Services that are not on the JVM send and receive via the Greyhound proxy over gRPC, which then handles the Kafka interaction on their behalf. This is also what made the Confluent Cloud migration possible at scale: routing logic and cluster addressing only had to be updated in Greyhound, not in each of the 2,200 services.
The Greyhound repository is available at github.com/wix/greyhound.
Schema management
Wix does not use Confluent Schema Registry. When Wix evaluated it, the registry did not fit their requirement to unify Kafka event schemas, gRPC service contracts, and external API definitions in a single consistent system. Instead, Wix's infrastructure and developer experience teams built a custom schema management platform backed by Protobuf. The platform provides automatic schema discovery, serialisation, validation, and compatibility enforcement across all three communication paradigms. The .proto file is the single source of truth for any data contract, regardless of whether the transport is Kafka, gRPC, or an HTTP API.
Deployment model
Wix runs on multi-cluster Confluent Cloud. Before 2021 the platform ran self-hosted Kafka clusters; the migration to managed Confluent is covered in the Challenges section below.
Stream processing
Apache Flink is used for stateful stream processing over Kafka topics. Wix runs Flink on Confluent Cloud in serverless mode, using FlinkSQL for complex aggregations and feature transformations for the ML feature store. Flink reads from and writes to Kafka topics directly.
Producer architecture
Kafka messages are Protobuf-serialised, with Greyhound handling serialisation through its default serialiser, which accepts Protobuf-derived Scala case classes. Producer configuration is managed centrally through Greyhound rather than in individual services.
Consumer architecture
Consumer group configuration, offset management, and retry topic routing are all managed through Greyhound. Lag monitoring is handled through the internal Kafka Control Plane, described in the Operations section. Greyhound emits a Prometheus gauge per topic-consumer-handler showing the currently longest-running handler in each pod, which feeds directly into lag dashboards.
Kafka Connect ecosystem
Wix uses Debezium as a Kafka Connect source connector for CDC, capturing database change events and publishing them as Kafka topics that downstream services can consume.
Special techniques and engineering innovations
gRPC fan-out proxy
High-volume topics consumed by many services were a meaningful driver of Confluent Cloud costs. Wix addressed this by building a push-based gRPC fan-out proxy. Instead of each subscribing service maintaining its own Kafka consumer for a shared topic, the proxy consumes each topic once and fans out delivery to N subscribers over gRPC streaming connections. This removed redundant consumer groups from Confluent and reduced Wix's Kafka infrastructure bill by 30%.
Chunked message delivery
Kafka's default maximum message size is 1 MB. Some Wix payloads, such as large site content objects, occasionally exceeded this limit. Rather than raising message.max.bytes cluster-wide, Wix's engineer Amit Pe'er built a Chunks Producer/Consumer pattern. The producer splits an oversized message into fragments, persists each fragment to a local H2 database on disk, and sends only the chunk IDs over Kafka. The consumer retrieves chunks by ID, with retry logic that handles out-of-order arrival, and reassembles the original message before passing it to the handler. The Kafka topic carries only small identifier messages; the bulk data moves through the local disk store.
Zero-downtime migration to Confluent Cloud
The 2021 migration from self-hosted Kafka to Confluent Cloud moved 10,000 topics and 100,000+ partitions serving 2,000 microservices without any service downtime. The approach relied on three decisions:
- Centralised routing in Greyhound: because all Kafka interaction goes through the SDK or proxy, cluster addressing only needed to change in one place
- Replication bridge: a dedicated replication service mirrored messages from the self-hosted cluster to Confluent Cloud while consumers were being cut over, so no messages were lost during the transition window
- Gradual per-topic cutover: topics were migrated in small batches rather than all at once, limiting blast radius and making per-topic rollback straightforward
Individual service owners made no code changes during the migration.
Operating Kafka at scale
Consumer lag tooling: TLLSR
Wix organises its Kafka operational tooling into five capabilities, described internally as TLLSR:
- Trace: follow a specific message through the processing pipeline to diagnose where it was delayed or dropped
- Lookup: retrieve a specific event by key from a topic, useful for investigating individual record issues
- Longest-Running: identify the slowest currently active consumer handler across the fleet, surfaced via the Prometheus gauge Greyhound emits per pod
- Skip: advance the consumer offset past a stuck message, used when a specific record is causing repeated processing failures and the data loss is acceptable
- Redistribute: rebalance events from a lagging partition across all partitions, resolving single-partition lag caused by hot key distribution
These capabilities are available through Wix's internal Kafka Control Plane UI, which includes a dedicated Consumers Lag View that developers use to investigate root causes across all 2,200 services from a single interface.
Monitoring
Greyhound emits Prometheus metrics for each topic-consumer-handler pair, including the longest-running handler currently active in each pod. These feed into the lag dashboards in the Control Plane. The combination of per-pod handler metrics and the TLLSR tooling means developers can diagnose most consumer lag issues without needing to inspect broker-side metrics directly.
Schema governance
Schema compatibility is enforced automatically through Wix's custom schema platform. New schemas are reviewed before publishing, and breaking changes are caught at the tooling layer rather than at runtime. Because the same .proto files serve both Kafka and gRPC, schema governance covers the full inter-service communication surface in one process.
Challenges and how they solved them
Single-partition consumer lag
Problem: Events with poorly distributed keys concentrated all traffic on one partition, while adjacent partitions sat idle. The lagging partition accumulated a backlog that the consumer could not drain.
Root cause: Key design that did not distribute cardinality evenly across the partition space.
Solution: Wix built the Redistribute tool inside Greyhound, which takes events from the lagging partition and spreads them across all partitions. This restores the parallel processing that the consumer handler was configured for.
Outcome: Operators can resolve single-partition lag incidents through the Control Plane UI without code changes or consumer restarts.
Slow consumer handlers blocking a partition
Problem: One slow or stuck consumer handler blocked all subsequent messages in the same partition. Because Kafka delivers messages in order within a partition, a handler that does not complete its work holds up everything behind it.
Root cause: An unexpected downstream slowdown, typically a slow database call or a downstream service timeout, that caused handler processing time to spike.
Solution: The Longest-Running monitor in the Control Plane surfaces the offending handler and pod. For cases where the stuck message cannot be re-processed successfully, the Skip tool advances the offset past it. For higher-value flows where skipping is not acceptable, Greyhound's retry policy routes the message to a retry topic, unblocking the main partition while the retry works in the background.
Migrating 2,000 services to Confluent Cloud without downtime
Problem: Wix's self-hosted Kafka clusters had grown to 10,000 topics and 100,000+ partitions. Moving to managed Kafka without involving the owners of 2,000 individual services required a migration approach that was transparent to producers and consumers.
Root cause: Self-hosted operational overhead had grown to the point where the engineering investment in cluster management outweighed its benefits.
Solution: The Greyhound abstraction was the enabling factor. Because routing and cluster addressing were centralised in the SDK and proxy, the migration team updated one layer rather than 2,000 services. A replication bridge kept data flowing from old to new clusters during the cutover window, and per-topic gradual migration kept blast radius small.
Outcome: Zero downtime, no service code changes required.
Schema management across Kafka and gRPC
Problem: Wix needed to govern event schemas for Kafka topics, gRPC service contracts, and external APIs in a unified way. Confluent Schema Registry did not support this multi-paradigm requirement.
Root cause: Wix's architecture uses Kafka, gRPC, and HTTP as first-class transport protocols, and each had historically been managed separately, creating drift and duplication.
Solution: A custom schema management platform, backed by Protobuf, that treats the .proto file as the single authoritative schema for all three transports. Automatic discovery, compatibility checking, and developer tooling are built on top of this layer.
Outcome: A new domain can be onboarded and fully integrated with surrounding services in a few days, down from weeks when schemas were managed separately per protocol.
Kafka bill from fan-out consumption
Problem: High-volume topics consumed by many services created redundant consumer groups on Confluent Cloud, each paying for the same data repeatedly.
Root cause: Each subscribing service maintained its own Kafka consumer, which is the natural pattern for Kafka but expensive when the same high-volume topic has many subscribers.
Solution: A push-based gRPC fan-out proxy that consumes each high-volume topic once and delivers to all subscribers over gRPC streaming, eliminating the redundant consumer groups.
Outcome: 30% reduction in Kafka infrastructure costs.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Centralise your Kafka client layer early. Wix's ability to migrate 2,000 services to Confluent Cloud without touching individual service code came directly from having a shared SDK that all services used. If each service had configured Kafka independently, the migration would have required coordinating changes across the entire engineering organisation. A shared client layer is worth the investment before you need to do something like this, not after.
- Treat Kafka topic schemas as API contracts, not implementation details. Wix uses the same Protobuf
.protofile as the contract for Kafka events, gRPC calls, and external APIs. This eliminates schema drift between communication protocols and makes onboarding new domains faster because the tooling and governance are the same regardless of transport. - Consumer lag tooling needs both visibility and remediation. Wix found that monitoring consumer lag is not enough on its own. Their TLLSR tooling pairs each monitoring capability (Trace, Lookup, Longest-Running) with a corresponding action (Skip, Redistribute). Without the remediation side, engineers can diagnose problems but still need to write ad-hoc scripts or perform manual offset management to resolve them.
- Fan-out patterns can meaningfully reduce managed Kafka costs. If multiple services consume the same high-volume topic, maintaining separate consumer groups for each scales your Confluent Cloud costs linearly with the number of subscribers. A push-based fan-out proxy, consuming each topic once, can reduce that cost significantly as Wix found with their 30% bill reduction.
- Migration to managed Kafka is feasible at large scale with the right abstraction. The routing of 10,000 topics and 100,000+ partitions from self-hosted clusters to Confluent Cloud, with zero downtime, was made tractable by having a single layer where cluster addressing lived. Any organisation considering a similar move should audit how many places in their codebase hold Kafka connection configuration before starting.
Sources and further reading
- Natan Silnitsky — Using Apache Kafka as the event-driven system for 1,500 microservices at Wix (Wix Engineering / Medium, 2020)
- Natan Silnitsky — Migrating to a multi-cluster managed Kafka with 0 downtime (Wix Engineering / Medium, 2021)
- Natan Silnitsky — How Wix manages schemas for Kafka (and gRPC) used by 2,000 microservices (Wix Engineering / Medium, 2021)
- Natan Silnitsky — 5 lessons learned for successful migration to Confluent Cloud (Kafka Summit Americas 2021)
- Natan Silnitsky — Kafka-based global data mesh at Wix (Kafka Summit London 2022)
- Natan Silnitsky — Troubleshooting Kafka for 2,000 microservices at Wix (Wix Engineering / Medium, 2022)
- Natan Silnitsky — 6 event-driven architecture patterns, part 1 (Wix Engineering / Medium, 2021)
- Natan Silnitsky — 6 event-driven architecture patterns, part 2 (Wix Engineering / Medium, 2021)
- Amit Pe'er — Chunks producer/consumer: handling large messages within Kafka (Wix Engineering / Medium, 2021)
- Greyhound — rich Kafka client library (Wix GitHub, 2020–present)
If you want visibility into your own Kafka consumers, lag, and message throughput, Kpow gives you a single interface for monitoring and managing Kafka clusters. You can connect it to any cluster, including Confluent Cloud, and try it free for 30 days.