Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Spotify used Apache Kafka in production

Table of contents

Factor House
May 11th, 2026
xx min read

Between roughly 2013 and 2017, Spotify ran one of the more thoroughly documented Kafka deployments in the industry. At peak, their event delivery system processed 700,000 events per second across five datacenters and fed every major feature on the platform — Discover Weekly, Year in Music, Spotify Wrapped, and Billboard's streaming charts all relied on data moving through Kafka. In 2017, Spotify decommissioned their Kafka infrastructure entirely, migrating to Google Cloud Pub/Sub as part of a broader move to Google Cloud Platform. What makes this story worth reading is what happened in between: the architectural decisions Spotify made, the specific failure modes they encountered at scale, and the engineering trade-offs that led them to walk away from Kafka for a managed alternative.

Company overview

Spotify is a music and podcast streaming service founded in Stockholm in 2006. By 2015, when much of the documented Kafka work was underway, the platform had 60 million monthly active users, 30 million songs, and 1.5 billion playlists. By early 2016, that had grown to 100 million monthly active users across 61 markets.

The company adopted Kafka around 2013 to replace an ad-hoc log-shipping system that tailed files from service hosts and sent them toward a central Hadoop cluster. Event volume was growing fast enough that iterating on the existing system was no longer practical, and the team needed a message bus that could handle cross-datacenter delivery at scale.

Key Kafka milestones:

Date Event
~2013 Adopted Apache Kafka 0.7 for event delivery
January 2015 Published Storm/Kafka personalization architecture — 3 billion+ events/day
2015 Production peak: 700,000 events/second across 5 datacenters
2015–2016 Evaluated Kafka 0.8 with MirrorMaker; encountered data loss and producer failure issues
March 2016 Announced migration to Google Cloud Pub/Sub; load-tested replacement at 2 million messages/second
February 2017 Kafka system fully decommissioned; Cloud Pub/Sub live in production
Q1 2019 Post-migration: 8 million events/second peak, 500 billion events/day, 350+ TB raw data/day
October 2021 EDI v3: 600+ event types, 70 TB/day compressed, Dataflow replacing Dataproc

Spotify's Kafka use cases

Kafka sat at the centre of Spotify's event delivery infrastructure (EDI) and served as the source of truth for all user interaction data. Every time a user played a song, skipped a track, viewed an ad, or performed a search, the client emitted an event that made its way through a Kafka producer, into a broker, and out to one or more downstream consumers.

Event delivery for analytics and features: The primary pipeline delivered event data to a central Hadoop cluster, where hourly Crunch/MapReduce ETL jobs converted raw tab-separated text into Apache Avro format for downstream analytics. This data fed Discover Weekly, Year in Music, and Spotify Wrapped, as well as Billboard's streaming charts and the internal A/B testing infrastructure.

Real-time personalization via Apache Storm: In parallel with the batch path, Apache Storm topologies consumed from Kafka topics in real time. Topics carried event types including song completions and ad impressions. Each Storm topology subscribed to relevant topics, enriched events with entity metadata fetched from Cassandra (for example, attaching the genre of a completed song), grouped events by user, and computed user attributes for recommendation and ad-targeting models.

Log aggregation across data centers: Spotify operated four owned data centers plus capacity on Google Cloud Platform during this period. Kafka served as the collection point for service logs from all hosts, with a custom component handling cross-datacenter batching and forwarding.

Ownership was split along functional lines: the event delivery team owned the Kafka infrastructure and producer daemons, while feature teams (personalization, ads, data engineering) owned the Storm topologies and downstream consumers.

Scale and throughput

During the Kafka era, production throughput peaked at 700,000 events per second. The team load-tested a replacement system at 2,000,000 events per second — roughly triple their production peak — before committing to the migration.

The Storm cluster processing real-time events from Kafka consisted of 6 nodes with 24 cores per host and processed over 3 billion events per day by January 2015. Spotify ran this infrastructure across five datacenters simultaneously, with cross-datacenter forwarding handled by the Grouper component described below.

After the migration to Google Cloud Pub/Sub, the same event volume continued to grow:

Period Metric Value
2014–2015 (Kafka) Peak production throughput 700,000 events/second
2015 (Kafka) Real-time pipeline daily volume 3+ billion events/day
April 2017 (Pub/Sub) Daily event volume 100 billion events/day
Q1 2019 (Pub/Sub) Peak throughput 8 million events/second
Q1 2019 (Pub/Sub) Daily raw data 350+ TB/day
Q1 2019 (Pub/Sub) Daily event volume 500 billion events/day
2021 (Pub/Sub + Dataflow) Distinct event types 600+
2021 (Pub/Sub + Dataflow) Daily compressed data 70 TB/day

The post-migration growth is notable: peak throughput went from 700K to 8 million events/second — more than an 11x increase — in roughly two years, without Kafka in the stack.

Spotify's Kafka architecture

High-level architecture

The event delivery system had four principal components: a Syslog Producer on every service host, Kafka Brokers receiving those events, custom Grouper services for cross-datacenter forwarding, and an HDFS Consumer persisting data to Hadoop.

Apache Storm topologies ran alongside this, consuming from Kafka topics directly for real-time processing, with Cassandra serving as the metadata and user-profile store downstream.

Producer architecture

A daemon called the Kafka Syslog Producer ran on every service host. Its role was to tail log files and batch log lines toward the Kafka cluster. Events at this stage were treated as unstructured lines regardless of type — there was no schema enforcement at the producer level. The producer maintained a checkpoint (end-of-file marker) per log file to track delivery state; missing markers after host failure required manual intervention.

Batching and compression were applied at the Grouper stage rather than the producer, which simplified the per-host daemon but pushed cross-datacenter optimisation downstream.

Consumer architecture

Two consumer patterns ran in parallel from the same Kafka topics:

Batch consumers wrote events to HDFS for processing by hourly Crunch/MapReduce jobs. These converted tab-separated raw event data into Apache Avro format, partitioned by hour.

Real-time consumers used Kafka Consumer Spouts within Apache Storm topologies. Each topology version used a unique Kafka consumer group ID so that old and new versions could consume the same topic simultaneously during deployments. This allowed operators to run both versions in parallel and roll back without losing events. The team also tuned rebalancing.max.tries to reduce consumer rebalancing errors under their workload.

A Liveness Monitor tracked which service hosts were active during each hour by querying service discovery systems. This enabled the batch consumers to verify that all expected events for a given hour had been received before marking the partition complete.

Stream processing

Apache Storm handled real-time event processing in Spotify's personalization and ad-targeting pipelines. Storm topologies subscribed to Kafka topics (e.g. song completions, ad impressions), fetched entity metadata from Cassandra, grouped events per user, and computed user attributes that fed recommendation and targeting models.

The 6-node Storm cluster (24 cores/host) processed over 3 billion events per day by early 2015 and supported use cases including Discover Weekly personalisation, ad targeting, and data visualisation.

The Grouper: cross-datacenter forwarding

Spotify built a custom component called the Grouper that consumed all event streams from a local datacenter and republished them as a single, compressed, efficiently batched topic for cross-datacenter transmission. Without the Grouper, raw Kafka streams would have saturated cross-datacenter links. The Grouper also allowed the team to enforce batching and compression consistently at the datacenter boundary rather than at each individual producer.

Special techniques and engineering innovations

Dual consumer groups for safe topology rollouts: Each new Storm topology version was deployed with a unique Kafka consumer group ID. Both the old and new topology versions consumed the full message stream simultaneously during a rollout window. If the new version misbehaved, operators could roll back without message loss and without manual reprocessing. This pattern decoupled deployment risk from data continuity.

Custom Grouper for cross-datacenter efficiency: Rather than forwarding per-topic streams cross-datacenter, the Grouper consumed multiple topics locally, merged and compressed them, and republished a single efficient stream. This was a pragmatic response to both bandwidth constraints and Kafka 0.7's lack of native cross-datacenter replication.

Custom async-google-pubsub-client: When migrating to Cloud Pub/Sub, the official Google Java client could not meet Spotify's throughput requirements. The team built an open-source, high-performance async Java client — async-google-pubsub-client — which ran in production for over a year before Google's official libraries reached comparable performance.

Isolation per event type (post-Kafka lesson): A direct lesson from the Kafka era's shared-channel problems was applied in the Pub/Sub architecture: each of 500+ event types received its own topic, ETL process, and final storage location. This prevented high-volume events from disrupting business-critical ones, and let the team assign different SLOs per event type (hours for high priority, up to 72 hours for low priority).

Operating Kafka at scale

Spotify's Kafka deployment was entirely self-managed, running on owned datacenter infrastructure and configured via Puppet. The system spanned five datacenters, with the Grouper component managing cross-datacenter forwarding.

Observability: The team monitored consumer group lag and rebalancing events, and tuned rebalancing.max.tries to reduce noise from regular rebalancing errors. A dedicated Liveness Monitor tracked active service hosts by hour to detect gaps in event delivery before they became incidents.

Deployment and upgrades: Configuration changes were deployed via Puppet. The tight coupling between system components meant that even small changes to one component could cause system-wide outages, which the team described as difficult to recover from. There was no clean way to iterate on individual components in isolation.

Incident response: Missing end-of-file markers after host failure required manual intervention to unblock the Liveness Monitor and resume processing for the affected hour. The lack of automated recovery for this scenario was one of the documented pain points of the v1 architecture.

Developer experience: Spotify open-sourced docker-kafka (a combined Kafka and ZooKeeper Docker image) to give developers a local Kafka environment for testing. The repository is now archived.

Challenges and how they solved them

Kafka 0.7 had no reliable broker-level persistence

Problem: Kafka 0.7 did not support broker-level replication. All reliable persistence lived in Hadoop, making it the single point of failure for the entire event delivery system.

Root cause: Architectural limitation of the Kafka version in use. Broker-level replication was not available until later releases.

Solution: The team accepted the limitation and designed the system around HDFS as the persistence layer. This worked until the coupling between Hadoop availability and event delivery became operationally unacceptable.

Outcome: The design constraint accumulated into the technical debt that made the eventual migration necessary.

Cross-datacenter delivery at 700K events/second

Problem: Producers across five datacenters needed to deliver events to a central Hadoop cluster. Raw event streams would have exhausted cross-datacenter bandwidth, and producers needed distant datacenter acknowledgement before considering delivery complete.

Root cause: No native cross-datacenter replication in Kafka 0.7; bandwidth constraints at datacenter links.

Solution: Built the custom Grouper component to consume event streams locally, compress and batch them, and forward a single efficient stream per datacenter.

Outcome: The Grouper reduced cross-datacenter bandwidth and improved delivery reliability, at the cost of a custom component that the team had to maintain and operate.

Kafka 0.8 MirrorMaker silently dropped data

Problem: When the team evaluated upgrading to Kafka 0.8, MirrorMaker instances dropped data while reporting to the source cluster that mirroring had succeeded. The producer also required full service restarts after failure, with no automated recovery path.

Root cause: Bugs in the MirrorMaker coordination logic in the Kafka 0.8 version under evaluation.

Solution: Abandoned the Kafka 0.8 path entirely. Rather than invest further in debugging MirrorMaker, the team migrated to Google Cloud Pub/Sub and built a custom high-throughput Java client (async-google-pubsub-client) for the migration.

Outcome: A five-day consumer stability test of the Pub/Sub system showed a median end-to-end latency of 20 seconds with zero message loss. The Kafka system was decommissioned in February 2017.

No quality-of-service differentiation between event types

Problem: All event types shared the same delivery channel. A surge in high-volume, low-priority events could degrade delivery of business-critical event types, with no mechanism to prioritise one over another.

Root cause: The single shared Kafka pipeline had no per-event-type QoS controls.

Solution: The Pub/Sub architecture assigned each event type its own topic, ETL process, and storage location. Priority tiers determined SLOs: High (few hours), Normal (24 hours), Low (72 hours).

Outcome: By 2019, the team could state that "a noisy, broken, or blocked event type will not halt the rest of the system."

Tight component coupling blocked iteration

Problem: Even small changes to one component caused system-wide outages that were hard to diagnose and recover from. The team could not improve the system incrementally.

Root cause: All components were tightly coupled. The Hadoop dependency for persistence, the Grouper's cross-datacenter role, and the producer's manual checkpoint logic created a fragile dependency graph.

Solution: A full system redesign, not an iterative fix. The new architecture decomposed the system into independent microservices with explicit interfaces, enabling independent deployment and failure isolation.

Outcome: By 2019, the event delivery system comprised approximately 15 microservices across ~2,500 VMs, each independently deployable.

Full tech stack

Category Tools Notes
Message broker (Kafka era) Apache Kafka 0.7, evaluated 0.8 Decommissioned February 2017
Message broker (post-Kafka) Google Cloud Pub/Sub Per-event-type topics; 7-day retention
Stream processing (Kafka era) Apache Storm 6-node cluster, 24 cores/host; consumed from Kafka topics
Stream/batch processing (post-Kafka) Scio (Scala API for Apache Beam) on Google Dataflow Unified batch and streaming; open-sourced by Spotify
Schema Apache Avro Applied via hourly ETL; no schema enforcement at Kafka producer level
Storage (Kafka era) HDFS on Hadoop Single point of failure in original architecture
Batch processing (Kafka era) Crunch/MapReduce on Hadoop Hourly Avro ETL from HDFS
Storage (post-Kafka) Google Cloud Storage, BigQuery GCS for hourly immutable partitions; BQ for analytics
Database Apache Cassandra User profile attributes and entity metadata for Storm enrichment
Coordination (Kafka era) Apache ZooKeeper Kafka dependency
Configuration management Puppet Replaced as part of broader cloud migration
Local development (Kafka era) spotify/docker-kafka (Docker) Combined Kafka + ZooKeeper image; now archived
Internal container orchestration Helios Spotify's precursor to Kubernetes
Compute infrastructure Google Compute Engine, Regional Managed Instance Groups Post-migration; approximately 2,500 VMs
Deduplication Google Dataproc, later Google Dataflow Multi-week lookback deduplication using event message identifiers
Sensitive data handling Google Dataflow In-flight encryption for events containing personal data

Key contributors

Name Role Contribution
Igor Maravić Software Engineer, Spotify Led the event delivery migration; authored the three-part "Road to the Cloud" blog series; presented at QCon New York 2016
Neville Li Data Infrastructure Engineer, Spotify Created Scio, Spotify's Scala API for Apache Beam; presented at QCon New York 2016
Kinshuk Mishra Engineer, Spotify Published the Storm/Kafka personalization architecture (2015)
Matt Brown Engineer, Spotify Co-authored the Cassandra personalization post (2015)
Bartosz Janota Data Infrastructure Engineer, Spotify Co-authored "Life in the Cloud" retrospective (2019)
Robert Stephenson Senior Product Manager, Spotify Co-authored the 2019 and 2021 EDI articles
Flavio Santos Data Infrastructure Engineer, Spotify Co-authored the EDI v3 migration article (2021)

Key takeaways for your own Kafka implementation

  • Operational burden compounds with scale. Spotify's Kafka setup functioned well at 700K events/second but required increasing amounts of custom work (the Grouper, manual EOF recovery, tight Puppet-based configuration) to hold together. If you are building on Kafka, account for the operational investment required as throughput grows, especially if you are running across multiple datacenters.
  • Cross-datacenter Kafka without native replication requires custom solutions. Spotify built the Grouper specifically because Kafka 0.7 lacked reliable cross-datacenter capabilities. If you are running Kafka across regions today, verify which replication mechanisms you are relying on and test their failure modes explicitly — Spotify's MirrorMaker evaluation is a reminder that silent data loss is a realistic failure mode.
  • Per-event-type isolation changes the operational model significantly. Spotify's shared Kafka pipeline had no QoS controls, which meant any event type could affect all others. The isolation-per-event-type pattern they adopted post-migration (one topic per event type, independent SLOs) is applicable with Kafka: topic-per-event-type design increases operational surface area but limits blast radius when a single consumer or producer misbehaves.
  • Managed services trade operational control for operational simplicity. Spotify's migration to Pub/Sub removed a significant operational burden and enabled 11x throughput growth without a proportional increase in infrastructure work. If you are evaluating self-managed Kafka against a managed alternative, Spotify's trajectory is a concrete reference point for what that trade looks like at 700K+ events/second.
  • Decouple deployment risk from data continuity. Spotify's dual consumer group pattern for Storm rollouts — where old and new topology versions consumed the same topic simultaneously — is a practical technique for any system where consumer logic changes frequently. It removes message loss as a rollback cost and lets you validate new consumers against live data before cutting over.

Sources, further reading and CTAs

Primary sources:

Continue reading:

If you want to monitor and inspect your own Kafka clusters — consumer lag, topic throughput, partition health — give Kpow a try. It connects to any Kafka cluster in minutes and is available for a free 30-day trial. You can deploy it via Docker, Helm, or JAR.