Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Wix uses Apache Kafka in production

Table of contents

Factor House
May 16th, 2026
xx min read

Wix runs Apache Kafka as the event backbone for more than 2,200 microservices, processing 66 billion messages every day across 50,000 topics and 500,000 partitions. The scale is notable, but the more interesting engineering story is how Wix achieved it: by building Greyhound, an open-source Kafka SDK that abstracts the native client for every service in the fleet, and by migrating the entire platform to Confluent Cloud in 2021 without taking a single service offline.

Company overview

Wix is a cloud-based website creation and hosting platform used by more than 200 million registered users worldwide. At its core, the platform is a collection of loosely coupled product surfaces: site builder, e-commerce, bookings, blog, CRM, and several hundred third-party app integrations. Serving those surfaces requires Wix's engineering organisation to coordinate over 2,200 microservices, deployed across 18 data centres and points of presence, spanning two cloud providers (Google Cloud and AWS) and four geographic regions.

Kafka became the communication layer for those services as the microservices architecture matured. By 2020, 1,500 services were exchanging events through Kafka. By the time Natan Silnitsky presented at Kafka Summit London in April 2022, that number had grown to 2,200+ services, 50,000 topics, and 500,000+ partitions, handling 15 billion business events per day. The 66 billion total daily message count includes the internal infrastructure traffic that flows through Kafka alongside business events.

Milestone Date
Greyhound open-sourced on GitHub (v0.0.5) June 2020
Engineering post: Kafka as event-driven system for 1,500 microservices 2020
Migration of 2,000 microservices to multi-cluster Confluent Cloud, zero downtime 2021
Kafka Summit Americas: "5 lessons learned migrating to Confluent Cloud" 2021
Kafka Summit London keynote + "Kafka-based global data mesh at Wix" talk April 2022
Scale reaches 2,200+ microservices, 50K topics, 500K+ partitions, 15B daily business events 2022
Online ML feature store rebuilt on Kafka and Flink, supporting 3,000+ features 2023

Wix's Kafka use cases

Microservices event-driven communication

The primary use of Kafka at Wix is replacing synchronous RPC between services with asynchronous events. Rather than service A calling service B directly, A publishes a domain event to a Kafka topic and any interested downstream service subscribes. This decoupling allows Wix to evolve services independently and absorb traffic spikes without cascading failures across the dependency graph.

Materialized views for query offloading

One concrete problem Kafka solved was query volume on Wix's MetaSite service, which holds metadata for every site on the platform: owner, version, and installed applications. Before Kafka-based projections, this single service was receiving over 1 million requests per minute from other services that needed a slice of that data.

The solution was to stream all MetaSite change events to a Kafka topic and let each downstream consumer build its own purpose-built read model. Services that previously queried MetaSite directly now maintain a local projection that is updated via Kafka, reducing their dependency to an eventual-consistency relationship rather than a real-time RPC call.

Change data capture

Wix uses Debezium connectors to capture row-level database changes and publish them as Kafka events. Services downstream can consume the CDC topic and produce their own clean domain event contracts rather than exposing raw database schemas as the inter-service API.

Protobuf-backed API contracts

At Wix, a Kafka topic schema is an API contract. All Kafka messages are serialised with Protobuf, and the .proto file is the canonical definition shared across Kafka, gRPC, and external-facing APIs. This means a new domain can be onboarded and integrated with surrounding services within a few days.

Real-time ML feature store

Wix rebuilt its online feature store around Kafka and Apache Flink. The system processes billions of events daily and maintains over 3,000 ML features with near-real-time update latency, feeding personalisation and recommendation models across the platform.

Global data mesh

Wix's platform spans four regions and two cloud providers. Rather than each service knowing where to find its upstreams, Kafka provides the transport layer for a global data mesh. Services publish to a logical topic namespace and Kafka handles the cross-region replication, meaning producers and consumers remain unaware of the underlying geographic topology.

Scale and throughput

Metric Value
Total daily Kafka messages 66 billion
Daily business events (data mesh) 15 billion across 4 regions
Microservices using Kafka 2,200+
Topics ~50,000
Partitions 500,000+
Geographic regions 4
Data centres and POPs 18
Cloud providers 2 (Google Cloud, AWS)
Developers interacting with Kafka daily 900+

Wix's Kafka architecture

The Greyhound SDK

No Wix service talks to Kafka through the native client directly. Every JVM service goes through Greyhound, the open-source Kafka SDK that Wix's data streaming team built and maintains. Greyhound is written in Scala using the ZIO functional library and adds a higher-level interface on top of the native Kafka client, providing:

  • Concurrent message processing: configurable parallelism per consumer handler, not limited by partition count
  • Retry policies: first-class retry configuration on both consumers and producers, using dedicated retry topics with back-off rather than blocking partition processing
  • Batch processing: optional batch consumption for throughput-sensitive pipelines
  • Context propagation and metrics: distributed tracing context forwarded through events; Prometheus gauges emitted per topic-consumer-handler

For non-JVM services, Greyhound ships as a sidecar proxy. Services that are not on the JVM send and receive via the Greyhound proxy over gRPC, which then handles the Kafka interaction on their behalf. This is also what made the Confluent Cloud migration possible at scale: routing logic and cluster addressing only had to be updated in Greyhound, not in each of the 2,200 services.

The Greyhound repository is available at github.com/wix/greyhound.

Schema management

Wix does not use Confluent Schema Registry. When Wix evaluated it, the registry did not fit their requirement to unify Kafka event schemas, gRPC service contracts, and external API definitions in a single consistent system. Instead, Wix's infrastructure and developer experience teams built a custom schema management platform backed by Protobuf. The platform provides automatic schema discovery, serialisation, validation, and compatibility enforcement across all three communication paradigms. The .proto file is the single source of truth for any data contract, regardless of whether the transport is Kafka, gRPC, or an HTTP API.

Deployment model

Wix runs on multi-cluster Confluent Cloud. Before 2021 the platform ran self-hosted Kafka clusters; the migration to managed Confluent is covered in the Challenges section below.

Stream processing

Apache Flink is used for stateful stream processing over Kafka topics. Wix runs Flink on Confluent Cloud in serverless mode, using FlinkSQL for complex aggregations and feature transformations for the ML feature store. Flink reads from and writes to Kafka topics directly.

Producer architecture

Kafka messages are Protobuf-serialised, with Greyhound handling serialisation through its default serialiser, which accepts Protobuf-derived Scala case classes. Producer configuration is managed centrally through Greyhound rather than in individual services.

Consumer architecture

Consumer group configuration, offset management, and retry topic routing are all managed through Greyhound. Lag monitoring is handled through the internal Kafka Control Plane, described in the Operations section. Greyhound emits a Prometheus gauge per topic-consumer-handler showing the currently longest-running handler in each pod, which feeds directly into lag dashboards.

Kafka Connect ecosystem

Wix uses Debezium as a Kafka Connect source connector for CDC, capturing database change events and publishing them as Kafka topics that downstream services can consume.

Special techniques and engineering innovations

gRPC fan-out proxy

High-volume topics consumed by many services were a meaningful driver of Confluent Cloud costs. Wix addressed this by building a push-based gRPC fan-out proxy. Instead of each subscribing service maintaining its own Kafka consumer for a shared topic, the proxy consumes each topic once and fans out delivery to N subscribers over gRPC streaming connections. This removed redundant consumer groups from Confluent and reduced Wix's Kafka infrastructure bill by 30%.

Chunked message delivery

Kafka's default maximum message size is 1 MB. Some Wix payloads, such as large site content objects, occasionally exceeded this limit. Rather than raising message.max.bytes cluster-wide, Wix's engineer Amit Pe'er built a Chunks Producer/Consumer pattern. The producer splits an oversized message into fragments, persists each fragment to a local H2 database on disk, and sends only the chunk IDs over Kafka. The consumer retrieves chunks by ID, with retry logic that handles out-of-order arrival, and reassembles the original message before passing it to the handler. The Kafka topic carries only small identifier messages; the bulk data moves through the local disk store.

Zero-downtime migration to Confluent Cloud

The 2021 migration from self-hosted Kafka to Confluent Cloud moved 10,000 topics and 100,000+ partitions serving 2,000 microservices without any service downtime. The approach relied on three decisions:

  1. Centralised routing in Greyhound: because all Kafka interaction goes through the SDK or proxy, cluster addressing only needed to change in one place
  2. Replication bridge: a dedicated replication service mirrored messages from the self-hosted cluster to Confluent Cloud while consumers were being cut over, so no messages were lost during the transition window
  3. Gradual per-topic cutover: topics were migrated in small batches rather than all at once, limiting blast radius and making per-topic rollback straightforward

Individual service owners made no code changes during the migration.

Operating Kafka at scale

Consumer lag tooling: TLLSR

Wix organises its Kafka operational tooling into five capabilities, described internally as TLLSR:

  • Trace: follow a specific message through the processing pipeline to diagnose where it was delayed or dropped
  • Lookup: retrieve a specific event by key from a topic, useful for investigating individual record issues
  • Longest-Running: identify the slowest currently active consumer handler across the fleet, surfaced via the Prometheus gauge Greyhound emits per pod
  • Skip: advance the consumer offset past a stuck message, used when a specific record is causing repeated processing failures and the data loss is acceptable
  • Redistribute: rebalance events from a lagging partition across all partitions, resolving single-partition lag caused by hot key distribution

These capabilities are available through Wix's internal Kafka Control Plane UI, which includes a dedicated Consumers Lag View that developers use to investigate root causes across all 2,200 services from a single interface.

Monitoring

Greyhound emits Prometheus metrics for each topic-consumer-handler pair, including the longest-running handler currently active in each pod. These feed into the lag dashboards in the Control Plane. The combination of per-pod handler metrics and the TLLSR tooling means developers can diagnose most consumer lag issues without needing to inspect broker-side metrics directly.

Schema governance

Schema compatibility is enforced automatically through Wix's custom schema platform. New schemas are reviewed before publishing, and breaking changes are caught at the tooling layer rather than at runtime. Because the same .proto files serve both Kafka and gRPC, schema governance covers the full inter-service communication surface in one process.

Challenges and how they solved them

Single-partition consumer lag

Problem: Events with poorly distributed keys concentrated all traffic on one partition, while adjacent partitions sat idle. The lagging partition accumulated a backlog that the consumer could not drain.

Root cause: Key design that did not distribute cardinality evenly across the partition space.

Solution: Wix built the Redistribute tool inside Greyhound, which takes events from the lagging partition and spreads them across all partitions. This restores the parallel processing that the consumer handler was configured for.

Outcome: Operators can resolve single-partition lag incidents through the Control Plane UI without code changes or consumer restarts.

Slow consumer handlers blocking a partition

Problem: One slow or stuck consumer handler blocked all subsequent messages in the same partition. Because Kafka delivers messages in order within a partition, a handler that does not complete its work holds up everything behind it.

Root cause: An unexpected downstream slowdown, typically a slow database call or a downstream service timeout, that caused handler processing time to spike.

Solution: The Longest-Running monitor in the Control Plane surfaces the offending handler and pod. For cases where the stuck message cannot be re-processed successfully, the Skip tool advances the offset past it. For higher-value flows where skipping is not acceptable, Greyhound's retry policy routes the message to a retry topic, unblocking the main partition while the retry works in the background.

Migrating 2,000 services to Confluent Cloud without downtime

Problem: Wix's self-hosted Kafka clusters had grown to 10,000 topics and 100,000+ partitions. Moving to managed Kafka without involving the owners of 2,000 individual services required a migration approach that was transparent to producers and consumers.

Root cause: Self-hosted operational overhead had grown to the point where the engineering investment in cluster management outweighed its benefits.

Solution: The Greyhound abstraction was the enabling factor. Because routing and cluster addressing were centralised in the SDK and proxy, the migration team updated one layer rather than 2,000 services. A replication bridge kept data flowing from old to new clusters during the cutover window, and per-topic gradual migration kept blast radius small.

Outcome: Zero downtime, no service code changes required.

Schema management across Kafka and gRPC

Problem: Wix needed to govern event schemas for Kafka topics, gRPC service contracts, and external APIs in a unified way. Confluent Schema Registry did not support this multi-paradigm requirement.

Root cause: Wix's architecture uses Kafka, gRPC, and HTTP as first-class transport protocols, and each had historically been managed separately, creating drift and duplication.

Solution: A custom schema management platform, backed by Protobuf, that treats the .proto file as the single authoritative schema for all three transports. Automatic discovery, compatibility checking, and developer tooling are built on top of this layer.

Outcome: A new domain can be onboarded and fully integrated with surrounding services in a few days, down from weeks when schemas were managed separately per protocol.

Kafka bill from fan-out consumption

Problem: High-volume topics consumed by many services created redundant consumer groups on Confluent Cloud, each paying for the same data repeatedly.

Root cause: Each subscribing service maintained its own Kafka consumer, which is the natural pattern for Kafka but expensive when the same high-volume topic has many subscribers.

Solution: A push-based gRPC fan-out proxy that consumes each high-volume topic once and delivers to all subscribers over gRPC streaming, eliminating the redundant consumer groups.

Outcome: 30% reduction in Kafka infrastructure costs.

Full tech stack

Category Tool Role
Message broker Apache Kafka (Confluent Cloud, multi-cluster) Core event streaming platform; multi-region deployment across Google Cloud and AWS
Client SDK Greyhound (Wix open source, Scala/ZIO) High-level Kafka SDK for all JVM services; sidecar proxy for non-JVM services
Stream processing Apache Flink (Confluent Cloud serverless) Stateful stream processing for the online ML feature store; FlinkSQL for aggregations
Serialisation Protobuf Universal message format across Kafka, gRPC, and external APIs
Schema management Custom internal platform Unified Protobuf schema governance for Kafka, gRPC, and HTTP API contracts
CDC connector Debezium Captures database change events and publishes them to Kafka topics
Large-message storage H2 (embedded) On-disk chunk store for the Chunks Producer/Consumer pattern
Metrics Prometheus Consumer handler lag metrics emitted by Greyhound, per topic-consumer-handler per pod
Inter-service delivery gRPC Fan-out delivery from the Kafka proxy to downstream subscriber services
Infrastructure Google Cloud + AWS Multi-cloud hosting for Kafka clusters and compute across 4 regions

Key contributors

Name Role Contribution
Natan Silnitsky Backend-Infra Engineer, Data Streaming Team Primary architect of Wix's Kafka infrastructure; creator and lead of Greyhound; presenter at Kafka Summit Americas 2021 and Kafka Summit London 2022. Author of most of the Wix Kafka engineering blog series.
Avi Perez Engineering leader, Wix Featured in the Kafka Summit London 2022 keynote presenting Wix's approach to modern data flow and data pipelines.
Amit Pe'er Engineer, Wix Author of the Chunks Producer/Consumer pattern for handling oversized Kafka messages.

Key takeaways for your own Kafka implementation

  • Centralise your Kafka client layer early. Wix's ability to migrate 2,000 services to Confluent Cloud without touching individual service code came directly from having a shared SDK that all services used. If each service had configured Kafka independently, the migration would have required coordinating changes across the entire engineering organisation. A shared client layer is worth the investment before you need to do something like this, not after.
  • Treat Kafka topic schemas as API contracts, not implementation details. Wix uses the same Protobuf .proto file as the contract for Kafka events, gRPC calls, and external APIs. This eliminates schema drift between communication protocols and makes onboarding new domains faster because the tooling and governance are the same regardless of transport.
  • Consumer lag tooling needs both visibility and remediation. Wix found that monitoring consumer lag is not enough on its own. Their TLLSR tooling pairs each monitoring capability (Trace, Lookup, Longest-Running) with a corresponding action (Skip, Redistribute). Without the remediation side, engineers can diagnose problems but still need to write ad-hoc scripts or perform manual offset management to resolve them.
  • Fan-out patterns can meaningfully reduce managed Kafka costs. If multiple services consume the same high-volume topic, maintaining separate consumer groups for each scales your Confluent Cloud costs linearly with the number of subscribers. A push-based fan-out proxy, consuming each topic once, can reduce that cost significantly as Wix found with their 30% bill reduction.
  • Migration to managed Kafka is feasible at large scale with the right abstraction. The routing of 10,000 topics and 100,000+ partitions from self-hosted clusters to Confluent Cloud, with zero downtime, was made tractable by having a single layer where cluster addressing lived. Any organisation considering a similar move should audit how many places in their codebase hold Kafka connection configuration before starting.

Sources and further reading

If you want visibility into your own Kafka consumers, lag, and message throughput, Kpow gives you a single interface for monitoring and managing Kafka clusters. You can connect it to any cluster, including Confluent Cloud, and try it free for 30 days.