Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How LinkedIn uses Apache Kafka in production

Table of contents

Factor House
May 11th, 2026
xx min read

LinkedIn built Apache Kafka internally in 2010 to solve the problem of moving large volumes of event data reliably between internal systems. By 2019 the company was processing more than 7 trillion messages per day across over 100 clusters, 4,000 brokers, and 7 million partitions - one of the largest Kafka deployments publicly documented. Kafka connects virtually every system at LinkedIn, from member activity tracking and real-time search indexing to database replication and stream processing at scale.

Company overview

LinkedIn is a professional networking platform with over one billion members. Its products span hiring, marketing, learning, and news, and almost all of its personalisation, recommendations, and analytics workloads are underpinned by data pipelines.

Kafka was developed at LinkedIn in 2010 by Jay Kreps, Neha Narkhede, and Jun Rao. The motivation was straightforward: LinkedIn needed a unified, durable, high-throughput channel to move event data between its growing set of internal systems. It was open-sourced in June 2011 and donated to the Apache Software Foundation.

LinkedIn Kafka milestones:

Date Event
2010 Apache Kafka developed internally at LinkedIn
June 2011 Kafka open-sourced and released to the Apache community
July 2011 Kafka processing 1 billion messages per day in production
2012 20 billion messages per day
July 2013 200 billion messages per day
2013 Apache Samza open-sourced by LinkedIn
2015 1 trillion messages per day; peak of 4.5 million messages per second; Burrow open-sourced
2016 Approximately 1.4 trillion messages per day; approximately 1,400 brokers; more than 2 PB per week
August 2017 Kafka Cruise Control open-sourced
2018 Migrated from Kafka MirrorMaker to Brooklin for all cross-cluster replication
2019 7 trillion messages per day; Brooklin and Cruise Control Frontend open-sourced
April 2022 Load-balanced Brooklin Mirror Maker published; Microsoft Azure migration paused
November 2022 TopicGC published
2023 Apache Beam adopted to unify batch and streaming pipelines

LinkedIn's Kafka use cases

Kafka at LinkedIn is not limited to a single team or product area. It serves as the central data transport layer across the company.

Activity tracking is Kafka's original use case at LinkedIn. The platform captures member activity events - pageviews, search queries, and ad impressions - and delivers them to both offline batch analytics pipelines (Hadoop) and real-time online services.

Real-time search indexing uses Kafka to deliver network-update events to LinkedIn's search engine. According to the 2011 engineering post by Jay Kreps, updates become searchable within seconds of being posted.

Inter-service messaging runs asynchronously over Kafka topics, connecting LinkedIn's microservices without direct coupling between them.

Database replication (Espresso CDC) placed Kafka on a latency-sensitive, mission-critical path. LinkedIn replaced MySQL replication with a Kafka-backed CDC pipeline for Espresso, its internal NoSQL document store, requiring no-data-loss guarantees and low transfer latencies.

Derived-data upload to Venice uses Kafka to asynchronously upload data into LinkedIn's derived-data serving store, decoupling the write path from downstream consumers.

Stream processing with Samza relies entirely on Kafka as both input and output. Apache Samza, which LinkedIn open-sourced in 2013, consumes from and produces to Kafka for all real-time processing jobs. Samza also uses log-compacted Kafka topics as durable state backup stores.

Mobile event ingestion arrives via a REST interface built around the LiKafka client, allowing mobile devices and non-Java services to produce events into Kafka reliably without depending on the native Java client.

Cross-cluster and cross-datacenter aggregation is handled by Brooklin, which mirrors Kafka topics between clusters and datacenters to support centralised analytics and fault tolerance.

Log and metrics storage treats Kafka as the circulatory system of LinkedIn's infrastructure, routing log data and system metrics between internal services.

Scale & throughput

LinkedIn's Kafka footprint grew by a factor of approximately 1,200 in four years:

  • July 2011: 1 billion messages per day
  • 2012: 20 billion messages per day
  • July 2013: 200 billion messages per day
  • 2015: 1 trillion messages per day, with a peak throughput of 4.5 million messages per second

By 2019, LinkedIn had surpassed 7 trillion messages per day, spread across more than 100 clusters, 4,000 brokers, 100,000 topics, and 7 million partitions. Each message was consumed by approximately four applications on average.

Data volume reached 1.34 PB per week at the 1 trillion messages per day milestone in 2015. With approximately 1,400 brokers in a later configuration, LinkedIn received over 2 PB of data per week.

Brooklin, used for cross-cluster replication, was itself mirroring more than 7 trillion messages per day as of 2022.

The maximum message size is capped at 1 MB.

LinkedIn's Kafka architecture

LinkedIn's Kafka deployment is self-managed and multi-datacenter. The company operates its own data centers across US locations and Singapore.

Cluster topology

Each datacenter runs multiple Kafka clusters, each containing an independent set of data with distinct purposes. Activity tracking, Espresso replication, and metrics are kept in separate clusters. This isolation limits the blast radius of failures and allows cluster configurations to be tuned per workload.

The overall topology is tiered: producers write to a local Kafka cluster in their datacenter, data is replicated to aggregate clusters that consolidate events across datacenters, and consumers read from local copies. Data is copied only once per datacenter.

Cross-datacenter replication

LinkedIn replaced Kafka MirrorMaker (KMM) with Brooklin Mirror Maker (BMM) in 2018. A single Brooklin cluster handles hundreds of simultaneous datastreams, whereas the KMM approach required a separate cluster for each pipeline - an arrangement that became operationally unsustainable at LinkedIn's scale.

Kafka release branches

LinkedIn maintains internal release branches of Apache Kafka, suffixed -li, that are kept close to upstream. These are not a fork: LinkedIn submits patches upstream via Kafka Improvement Proposals (KIPs), either before cherry-picking them into the LinkedIn branch or shortly after committing a hotfix internally. New internal releases are cut approximately every quarter.

Schema management

All data pipelines at LinkedIn are standardised on Apache Avro. The LiKafka client handles schema registration with a central Schema Registry, Avro encoding and decoding, and auditing automatically on both producer and consumer sides. Consumers receive schemas without needing to manage them explicitly.

Producer architecture

LinkedIn's LiKafka client wraps the standard Kafka producer and adds schema registration, auditing, and large message support. Quota enforcement - per-producer bandwidth limits in bytes per second - is applied at the broker level to prevent any single application from saturating a cluster. Quotas can be exceeded by whitelisted users, and configuration changes take effect without a rolling broker restart.

Consumer architecture

Each Kafka message is consumed by approximately four applications on average. Offset management and consumer lag monitoring are handled through Burrow (described under Operating Kafka at scale). Consumer groups and lag are evaluated across every partition using a sliding-window model rather than static threshold comparisons.

Stream processing

Apache Samza handles all real-time stream processing at LinkedIn and depends on Kafka both as the event source and as a durable state store. Samza backs up stream processor state to log-compacted Kafka topics, enabling recovery after disk failures without replaying the entire topic history.

Kafka Connect ecosystem

LinkedIn built Gobblin, a Kafka-to-Hadoop ingestion framework, to continuously copy Kafka data into HDFS for offline analytics. Gobblin replaced an earlier framework called Camus. Brooklin also serves as a streaming bridge between Kafka and external systems including Azure Event Hubs and AWS Kinesis.

Special techniques & engineering innovations

Time-based replica lag thresholds: LinkedIn changed the Kafka replica lag calculation from a bytes-behind threshold to a time-based threshold. The bytes model caused large messages to incorrectly mark replicas as out of sync during normal operation. Switching to time prevents false positives from message size spikes.

Rack-aware replica placement: Replicas for a partition are placed on brokers in different datacenter racks. This prevents a top-of-rack switch failure from taking all replicas for a partition offline simultaneously.

TopicGC - automated unused topic garbage collection: LinkedIn built an internal service called TopicGC to detect unused Kafka topics, seal them (blocking reads and writes), disable mirroring, and delete them in batches of up to three concurrent deletions. A final usage check runs before each deletion step. In one of LinkedIn's largest data pipelines, TopicGC reduced the topic count by approximately 20% and improved produce and consume performance by at least 30%.

Kafka as Espresso CDC backbone: Replacing MySQL replication with Kafka for Espresso required significantly stronger delivery guarantees than typical event streaming. LinkedIn tuned for no-data-loss delivery while maintaining the low transfer latencies Espresso's online traffic requires.

LiKafka REST service for non-Java clients: The original REST interface had no guaranteed delivery. LinkedIn redesigned it so an event is only considered consumed after successful delivery to the destination topic, enabling reliable mobile and non-Java ingestion.

Load-balanced Brooklin partition assignment: The default Brooklin partition assignment distributed partitions by count rather than throughput. LinkedIn migrated Brooklin's Kafka connectors from high-level consumer APIs to manual partition assignment APIs and introduced a throughput-based assignment strategy, reducing replication latency imbalance.

Kafka Audit - end-to-end message completeness verification: LiKafka clients emit per-topic message counts in 10-minute windows to a dedicated audit topic. The Kafka Audit Service aggregates counts from producers, each tier of the cluster topology, and critical consumers such as Hadoop, verifying that no messages are lost or duplicated across the pipeline.

Operating Kafka at scale

Deployment model: LinkedIn runs Kafka on its own hardware across multiple datacenters. As of early 2022, a planned migration to Microsoft Azure was paused.

Consumer lag monitoring with Burrow: Burrow evaluates consumer group health across every partition using a sliding window of offset commits - approximately 10 commits, representing roughly 10 minutes of lag history at LinkedIn's typical commit rate. Rather than alerting on a fixed lag threshold, Burrow assesses the direction and rate of change of consumer lag to produce a health status: good, degraded, or stopped. It is exposed via an HTTP API and was open-sourced in 2015.

Cluster rebalancing with Cruise Control: Broker failures occur at LinkedIn's scale on a daily basis. Manual SRE intervention for partition reassignment was not sustainable. Cruise Control continuously monitors disk, network, and CPU utilisation across all brokers and automatically reassigns partitions to meet pre-defined performance goals. It also handles self-healing on broker failure. LinkedIn open-sourced Cruise Control in August 2017. The Cruise Control Frontend (CCFE), a web dashboard for managing all Kafka clusters, was open-sourced in 2019.

Live availability monitoring with Kafka Monitor: Kafka Monitor runs synthetic producer and consumer applications against live clusters to continuously validate ordering, delivery, and data integrity guarantees and measure end-to-end latencies. It is also used to qualify new Kafka builds before they reach production.

Self-service topic management with Nuage: LinkedIn's Nuage portal gives teams self-service access to Kafka topic lifecycle operations, including topic creation, deletion, configuration changes, metadata browsing, and ACL management. It delegates operations to the Kafka REST service.

Usage attribution with Bean Counter: Bean Counter tracks megabytes sent and received per application, attributing Kafka infrastructure costs to individual teams and providing an incentive for responsible usage.

Chaos testing with Simoorg: LinkedIn integrates Simoorg, its open-source failure inducer, into Kafka release validation. Simoorg introduces low-level failures such as dropped packets, low memory, failed disk writes, and process kills to verify that new Kafka builds behave correctly under real failure conditions.

Quarterly release cadence: LinkedIn cuts a new internal Kafka release approximately every quarter, tracking open-source trunk closely and upstreaming patches where possible.

Challenges & how they solved them

Kafka MirrorMaker data loss on upgrades and reboots

Root cause: The original KMM design could lose messages when a cluster was upgraded or a machine rebooted, because the delivery contract did not guarantee persistence before marking a message as consumed.

Solution: LinkedIn first redesigned the delivery contract so a message is only marked consumed after successful delivery to the destination. Later, in 2018, LinkedIn replaced all KMM clusters with Brooklin, which also eliminated the operational overhead of managing hundreds of separate pipeline-specific clusters.

Outcome: Reliable cross-cluster replication at scale, with a single Brooklin cluster handling hundreds of simultaneous datastreams.

ZooKeeper controller initialisation bottleneck from metadata bloat

Root cause: Accumulation of unused Kafka topics inflated ZooKeeper response payloads. For LinkedIn's largest cluster, the ZooKeeper response size had reached 0.75 MB, close to the 1 MB limit that would trigger availability issues.

Solution: LinkedIn built TopicGC to automatically detect and delete unused topics in a controlled sequence, with a last-minute usage check before each deletion step.

Outcome: Topic count reduced by approximately 20% in one of LinkedIn's largest pipelines; produce and consume performance improved by at least 30%.

Unbalanced cluster load and partition skew after broker failures

Root cause: Broker failures at LinkedIn's scale occur daily. Manual partition reassignment by SREs was not a sustainable response.

Solution: LinkedIn built Cruise Control, which continuously monitors cluster resource utilisation and automatically reassigns partitions to meet disk, network, and CPU goals.

Outcome: Automated self-healing on broker failure; Cruise Control open-sourced and adopted across the industry.

Uneven load distribution across Brooklin replication tasks

Root cause: Brooklin's original partition assignment distributed partitions by count rather than by throughput, causing some tasks to process significantly more data than others.

Solution: LinkedIn migrated to manual partition assignment APIs and implemented a throughput-based partition assignment strategy.

Outcome: Reduced replication latency imbalance across Brooklin tasks.

False replica out-of-sync alerts caused by large messages

Root cause: The bytes-of-lag threshold used to determine replica health could be exceeded by a single large message, incorrectly marking a healthy replica as out of sync.

Solution: LinkedIn switched the replica lag calculation to a time-based threshold.

Outcome: Fewer false positive alerts and more accurate replica health reporting.

Full tech stack

Category Tools Notes
Message broker Apache Kafka (-li branches) LinkedIn-internal release branches kept close to upstream; patches submitted via KIPs
Schema registry LinkedIn Schema Registry Central Avro schema store; integrated with LiKafka client on both producer and consumer sides
Serialisation Apache Avro Mandatory standard across all LinkedIn data pipelines
Kafka client LiKafka Custom client wrapping the OSS Kafka producer and consumer; adds schema management, auditing, and large message support
Stream processing Apache Samza, Apache Beam, Apache Spark Samza for real-time; Beam adopted in 2023 for unified batch and streaming; Spark for petabyte-scale batch
Cross-cluster replication Brooklin (Brooklin Mirror Maker) Replaced Kafka MirrorMaker in 2018; also bridges Kafka to Azure Event Hubs and AWS Kinesis
Kafka-to-Hadoop ingestion Gobblin Replaced Camus; continuously copies Kafka data to HDFS
Storage and serving Espresso, Venice, Hadoop/HDFS Espresso (NoSQL, CDC via Kafka), Venice (derived-data store), HDFS (offline analytics)
Cluster rebalancing Kafka Cruise Control, Cruise Control Frontend (CCFE) Automated partition rebalancing and self-healing; CCFE provides a central web dashboard
Consumer lag monitoring Burrow Sliding-window lag analysis; HTTP API; open-sourced 2015
Live cluster health Kafka Monitor Synthetic produce/consume tests; used in release qualification
Topic lifecycle Nuage, TopicGC Nuage: self-service portal for topic management; TopicGC: automated garbage collection of unused topics
Usage attribution Bean Counter Per-application MB tracking for cost attribution
Chaos testing Simoorg Open-source failure inducer; integrated into Kafka release validation
Real-time OLAP Apache Pinot Serves partition-level metrics collected by Cruise Control
Coordination Apache ZooKeeper Kafka broker metadata coordination

Key contributors

Name Role Contribution
Jay Kreps Co-creator of Kafka at LinkedIn Co-designed and built the original Kafka system; co-authored the 2011 open-source release post
Neha Narkhede Co-creator of Kafka at LinkedIn Co-designed and built the original Kafka system; later co-founded Confluent
Jun Rao Co-creator of Kafka at LinkedIn Co-designed and built the original Kafka system; later co-founded Confluent
Kartik Paramasivam Engineering Lead, Kafka team Authored "How we're improving and advancing Kafka at LinkedIn" (2015); led quota, consumer, and reliability work
Todd Palino Data Infrastructure SRE Built Burrow; authored "Running Kafka at scale"; presented "More data centers, more problems" at Kafka Summit
Joel Koshy Staff Engineer, Kafka team Authored "Kafka ecosystem at LinkedIn"; presented "Kafkaesque days at LinkedIn in 2015" at Kafka Summit; led Nuage development
Efe Gencer Kafka team Created Cruise Control as an intern project in 2017
Joseph Lin Streaming Infrastructure team Co-invented TopicGC; implemented its original design (2022)
Lincong Li Streaming Infrastructure team Co-invented TopicGC; implemented its original design (2022)
Clark Haskins Data Infrastructure SRE Burrow core developer
Grayson Chao Data Infrastructure SRE Burrow core developer
Jon Bringhurst Data Infrastructure SRE Burrow core developer

Key takeaways for your own Kafka implementation

  • Separate clusters by workload type. LinkedIn runs multiple clusters per datacenter rather than a single shared cluster. This isolates failure domains and lets each cluster be tuned for its specific performance and durability requirements.
  • Automate rebalancing from day one. Manual partition reassignment does not scale. LinkedIn's experience building Cruise Control demonstrates that continuous automated rebalancing is more reliable than on-demand SRE intervention, particularly as broker failure frequency increases with fleet size.
  • Monitor consumer lag by behaviour, not by threshold. Burrow's sliding-window approach evaluates whether consumers are making progress relative to their recent history, rather than alerting when lag exceeds a fixed byte count. This reduces false positives and gives a more accurate picture of consumer health.
  • Invest in topic lifecycle management. Unused topics accumulate over time and can create real operational problems at scale. TopicGC's results at LinkedIn - a 20% reduction in topic count and a 30% improvement in performance - demonstrate that proactive topic hygiene is worth automating.
  • Standardise serialisation across all pipelines early. LinkedIn's universal adoption of Apache Avro and a central Schema Registry allowed teams to share infrastructure and tooling. Retrofitting schema standards onto an existing pipeline ecosystem is significantly harder.

Sources & further reading

If you are managing a Kafka deployment of your own, Kpow provides observability, control, and governance for Kafka clusters across self-managed, MSK, and Confluent environments. You can try it free for 30 days.