
How LinkedIn uses Apache Kafka in production
Table of contents
LinkedIn built Apache Kafka internally in 2010 to solve the problem of moving large volumes of event data reliably between internal systems. By 2019 the company was processing more than 7 trillion messages per day across over 100 clusters, 4,000 brokers, and 7 million partitions - one of the largest Kafka deployments publicly documented. Kafka connects virtually every system at LinkedIn, from member activity tracking and real-time search indexing to database replication and stream processing at scale.
Company overview
LinkedIn is a professional networking platform with over one billion members. Its products span hiring, marketing, learning, and news, and almost all of its personalisation, recommendations, and analytics workloads are underpinned by data pipelines.
Kafka was developed at LinkedIn in 2010 by Jay Kreps, Neha Narkhede, and Jun Rao. The motivation was straightforward: LinkedIn needed a unified, durable, high-throughput channel to move event data between its growing set of internal systems. It was open-sourced in June 2011 and donated to the Apache Software Foundation.
LinkedIn Kafka milestones:
LinkedIn's Kafka use cases
Kafka at LinkedIn is not limited to a single team or product area. It serves as the central data transport layer across the company.
Activity tracking is Kafka's original use case at LinkedIn. The platform captures member activity events - pageviews, search queries, and ad impressions - and delivers them to both offline batch analytics pipelines (Hadoop) and real-time online services.
Real-time search indexing uses Kafka to deliver network-update events to LinkedIn's search engine. According to the 2011 engineering post by Jay Kreps, updates become searchable within seconds of being posted.
Inter-service messaging runs asynchronously over Kafka topics, connecting LinkedIn's microservices without direct coupling between them.
Database replication (Espresso CDC) placed Kafka on a latency-sensitive, mission-critical path. LinkedIn replaced MySQL replication with a Kafka-backed CDC pipeline for Espresso, its internal NoSQL document store, requiring no-data-loss guarantees and low transfer latencies.
Derived-data upload to Venice uses Kafka to asynchronously upload data into LinkedIn's derived-data serving store, decoupling the write path from downstream consumers.
Stream processing with Samza relies entirely on Kafka as both input and output. Apache Samza, which LinkedIn open-sourced in 2013, consumes from and produces to Kafka for all real-time processing jobs. Samza also uses log-compacted Kafka topics as durable state backup stores.
Mobile event ingestion arrives via a REST interface built around the LiKafka client, allowing mobile devices and non-Java services to produce events into Kafka reliably without depending on the native Java client.
Cross-cluster and cross-datacenter aggregation is handled by Brooklin, which mirrors Kafka topics between clusters and datacenters to support centralised analytics and fault tolerance.
Log and metrics storage treats Kafka as the circulatory system of LinkedIn's infrastructure, routing log data and system metrics between internal services.
Scale & throughput
LinkedIn's Kafka footprint grew by a factor of approximately 1,200 in four years:
- July 2011: 1 billion messages per day
- 2012: 20 billion messages per day
- July 2013: 200 billion messages per day
- 2015: 1 trillion messages per day, with a peak throughput of 4.5 million messages per second
By 2019, LinkedIn had surpassed 7 trillion messages per day, spread across more than 100 clusters, 4,000 brokers, 100,000 topics, and 7 million partitions. Each message was consumed by approximately four applications on average.
Data volume reached 1.34 PB per week at the 1 trillion messages per day milestone in 2015. With approximately 1,400 brokers in a later configuration, LinkedIn received over 2 PB of data per week.
Brooklin, used for cross-cluster replication, was itself mirroring more than 7 trillion messages per day as of 2022.
The maximum message size is capped at 1 MB.
LinkedIn's Kafka architecture
LinkedIn's Kafka deployment is self-managed and multi-datacenter. The company operates its own data centers across US locations and Singapore.
Cluster topology
Each datacenter runs multiple Kafka clusters, each containing an independent set of data with distinct purposes. Activity tracking, Espresso replication, and metrics are kept in separate clusters. This isolation limits the blast radius of failures and allows cluster configurations to be tuned per workload.
The overall topology is tiered: producers write to a local Kafka cluster in their datacenter, data is replicated to aggregate clusters that consolidate events across datacenters, and consumers read from local copies. Data is copied only once per datacenter.
Cross-datacenter replication
LinkedIn replaced Kafka MirrorMaker (KMM) with Brooklin Mirror Maker (BMM) in 2018. A single Brooklin cluster handles hundreds of simultaneous datastreams, whereas the KMM approach required a separate cluster for each pipeline - an arrangement that became operationally unsustainable at LinkedIn's scale.
Kafka release branches
LinkedIn maintains internal release branches of Apache Kafka, suffixed -li, that are kept close to upstream. These are not a fork: LinkedIn submits patches upstream via Kafka Improvement Proposals (KIPs), either before cherry-picking them into the LinkedIn branch or shortly after committing a hotfix internally. New internal releases are cut approximately every quarter.
Schema management
All data pipelines at LinkedIn are standardised on Apache Avro. The LiKafka client handles schema registration with a central Schema Registry, Avro encoding and decoding, and auditing automatically on both producer and consumer sides. Consumers receive schemas without needing to manage them explicitly.
Producer architecture
LinkedIn's LiKafka client wraps the standard Kafka producer and adds schema registration, auditing, and large message support. Quota enforcement - per-producer bandwidth limits in bytes per second - is applied at the broker level to prevent any single application from saturating a cluster. Quotas can be exceeded by whitelisted users, and configuration changes take effect without a rolling broker restart.
Consumer architecture
Each Kafka message is consumed by approximately four applications on average. Offset management and consumer lag monitoring are handled through Burrow (described under Operating Kafka at scale). Consumer groups and lag are evaluated across every partition using a sliding-window model rather than static threshold comparisons.
Stream processing
Apache Samza handles all real-time stream processing at LinkedIn and depends on Kafka both as the event source and as a durable state store. Samza backs up stream processor state to log-compacted Kafka topics, enabling recovery after disk failures without replaying the entire topic history.
Kafka Connect ecosystem
LinkedIn built Gobblin, a Kafka-to-Hadoop ingestion framework, to continuously copy Kafka data into HDFS for offline analytics. Gobblin replaced an earlier framework called Camus. Brooklin also serves as a streaming bridge between Kafka and external systems including Azure Event Hubs and AWS Kinesis.
Special techniques & engineering innovations
Time-based replica lag thresholds: LinkedIn changed the Kafka replica lag calculation from a bytes-behind threshold to a time-based threshold. The bytes model caused large messages to incorrectly mark replicas as out of sync during normal operation. Switching to time prevents false positives from message size spikes.
Rack-aware replica placement: Replicas for a partition are placed on brokers in different datacenter racks. This prevents a top-of-rack switch failure from taking all replicas for a partition offline simultaneously.
TopicGC - automated unused topic garbage collection: LinkedIn built an internal service called TopicGC to detect unused Kafka topics, seal them (blocking reads and writes), disable mirroring, and delete them in batches of up to three concurrent deletions. A final usage check runs before each deletion step. In one of LinkedIn's largest data pipelines, TopicGC reduced the topic count by approximately 20% and improved produce and consume performance by at least 30%.
Kafka as Espresso CDC backbone: Replacing MySQL replication with Kafka for Espresso required significantly stronger delivery guarantees than typical event streaming. LinkedIn tuned for no-data-loss delivery while maintaining the low transfer latencies Espresso's online traffic requires.
LiKafka REST service for non-Java clients: The original REST interface had no guaranteed delivery. LinkedIn redesigned it so an event is only considered consumed after successful delivery to the destination topic, enabling reliable mobile and non-Java ingestion.
Load-balanced Brooklin partition assignment: The default Brooklin partition assignment distributed partitions by count rather than throughput. LinkedIn migrated Brooklin's Kafka connectors from high-level consumer APIs to manual partition assignment APIs and introduced a throughput-based assignment strategy, reducing replication latency imbalance.
Kafka Audit - end-to-end message completeness verification: LiKafka clients emit per-topic message counts in 10-minute windows to a dedicated audit topic. The Kafka Audit Service aggregates counts from producers, each tier of the cluster topology, and critical consumers such as Hadoop, verifying that no messages are lost or duplicated across the pipeline.
Operating Kafka at scale
Deployment model: LinkedIn runs Kafka on its own hardware across multiple datacenters. As of early 2022, a planned migration to Microsoft Azure was paused.
Consumer lag monitoring with Burrow: Burrow evaluates consumer group health across every partition using a sliding window of offset commits - approximately 10 commits, representing roughly 10 minutes of lag history at LinkedIn's typical commit rate. Rather than alerting on a fixed lag threshold, Burrow assesses the direction and rate of change of consumer lag to produce a health status: good, degraded, or stopped. It is exposed via an HTTP API and was open-sourced in 2015.
Cluster rebalancing with Cruise Control: Broker failures occur at LinkedIn's scale on a daily basis. Manual SRE intervention for partition reassignment was not sustainable. Cruise Control continuously monitors disk, network, and CPU utilisation across all brokers and automatically reassigns partitions to meet pre-defined performance goals. It also handles self-healing on broker failure. LinkedIn open-sourced Cruise Control in August 2017. The Cruise Control Frontend (CCFE), a web dashboard for managing all Kafka clusters, was open-sourced in 2019.
Live availability monitoring with Kafka Monitor: Kafka Monitor runs synthetic producer and consumer applications against live clusters to continuously validate ordering, delivery, and data integrity guarantees and measure end-to-end latencies. It is also used to qualify new Kafka builds before they reach production.
Self-service topic management with Nuage: LinkedIn's Nuage portal gives teams self-service access to Kafka topic lifecycle operations, including topic creation, deletion, configuration changes, metadata browsing, and ACL management. It delegates operations to the Kafka REST service.
Usage attribution with Bean Counter: Bean Counter tracks megabytes sent and received per application, attributing Kafka infrastructure costs to individual teams and providing an incentive for responsible usage.
Chaos testing with Simoorg: LinkedIn integrates Simoorg, its open-source failure inducer, into Kafka release validation. Simoorg introduces low-level failures such as dropped packets, low memory, failed disk writes, and process kills to verify that new Kafka builds behave correctly under real failure conditions.
Quarterly release cadence: LinkedIn cuts a new internal Kafka release approximately every quarter, tracking open-source trunk closely and upstreaming patches where possible.
Challenges & how they solved them
Kafka MirrorMaker data loss on upgrades and reboots
Root cause: The original KMM design could lose messages when a cluster was upgraded or a machine rebooted, because the delivery contract did not guarantee persistence before marking a message as consumed.
Solution: LinkedIn first redesigned the delivery contract so a message is only marked consumed after successful delivery to the destination. Later, in 2018, LinkedIn replaced all KMM clusters with Brooklin, which also eliminated the operational overhead of managing hundreds of separate pipeline-specific clusters.
Outcome: Reliable cross-cluster replication at scale, with a single Brooklin cluster handling hundreds of simultaneous datastreams.
ZooKeeper controller initialisation bottleneck from metadata bloat
Root cause: Accumulation of unused Kafka topics inflated ZooKeeper response payloads. For LinkedIn's largest cluster, the ZooKeeper response size had reached 0.75 MB, close to the 1 MB limit that would trigger availability issues.
Solution: LinkedIn built TopicGC to automatically detect and delete unused topics in a controlled sequence, with a last-minute usage check before each deletion step.
Outcome: Topic count reduced by approximately 20% in one of LinkedIn's largest pipelines; produce and consume performance improved by at least 30%.
Unbalanced cluster load and partition skew after broker failures
Root cause: Broker failures at LinkedIn's scale occur daily. Manual partition reassignment by SREs was not a sustainable response.
Solution: LinkedIn built Cruise Control, which continuously monitors cluster resource utilisation and automatically reassigns partitions to meet disk, network, and CPU goals.
Outcome: Automated self-healing on broker failure; Cruise Control open-sourced and adopted across the industry.
Uneven load distribution across Brooklin replication tasks
Root cause: Brooklin's original partition assignment distributed partitions by count rather than by throughput, causing some tasks to process significantly more data than others.
Solution: LinkedIn migrated to manual partition assignment APIs and implemented a throughput-based partition assignment strategy.
Outcome: Reduced replication latency imbalance across Brooklin tasks.
False replica out-of-sync alerts caused by large messages
Root cause: The bytes-of-lag threshold used to determine replica health could be exceeded by a single large message, incorrectly marking a healthy replica as out of sync.
Solution: LinkedIn switched the replica lag calculation to a time-based threshold.
Outcome: Fewer false positive alerts and more accurate replica health reporting.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Separate clusters by workload type. LinkedIn runs multiple clusters per datacenter rather than a single shared cluster. This isolates failure domains and lets each cluster be tuned for its specific performance and durability requirements.
- Automate rebalancing from day one. Manual partition reassignment does not scale. LinkedIn's experience building Cruise Control demonstrates that continuous automated rebalancing is more reliable than on-demand SRE intervention, particularly as broker failure frequency increases with fleet size.
- Monitor consumer lag by behaviour, not by threshold. Burrow's sliding-window approach evaluates whether consumers are making progress relative to their recent history, rather than alerting when lag exceeds a fixed byte count. This reduces false positives and gives a more accurate picture of consumer health.
- Invest in topic lifecycle management. Unused topics accumulate over time and can create real operational problems at scale. TopicGC's results at LinkedIn - a 20% reduction in topic count and a 30% improvement in performance - demonstrate that proactive topic hygiene is worth automating.
- Standardise serialisation across all pipelines early. LinkedIn's universal adoption of Apache Avro and a central Schema Registry allowed teams to share infrastructure and tooling. Retrofitting schema standards onto an existing pipeline ecosystem is significantly harder.
Sources & further reading
- [S1] Jay Kreps, LinkedIn Engineering Blog (2011): https://www.linkedin.com/blog/member/archive/open-source-linkedin-kafka
- [S2] Todd Palino, "Running Kafka at scale", LinkedIn Engineering Blog (March 2015): https://engineering.linkedin.com/kafka/running-kafka-scale
- [S3] Joel Koshy, "Kafka ecosystem at LinkedIn", LinkedIn Engineering Blog (~2016): https://www.linkedin.com/blog/engineering/open-source/kafka-ecosystem-at-linkedin
- [S4] Kartik Paramasivam, "How we're improving and advancing Kafka at LinkedIn", LinkedIn Engineering Blog (September 2015): https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin
- [S5] LinkedIn Engineering team, "Apache Kafka: more than 1 trillion messages", LinkedIn Engineering Blog (2019): https://www.linkedin.com/blog/engineering/open-source/apache-kafka-trillion-messages
- [S6] LinkedIn Streaming Infrastructure team, "Load-balanced Brooklin Mirror Maker", LinkedIn Engineering Blog (April 2022): https://engineering.linkedin.com/blog/2022/load-balanced-brooklin-mirror-maker--replicating-large-scale-kaf
- [S7] LinkedIn Streaming team, "Brooklin: near real-time data streaming at scale", LinkedIn Engineering Blog (August 2019): https://engineering.linkedin.com/blog/2019/brooklin-open-source
- [S8] Junaid Effendi, "LinkedIn data tech stack", Substack (October 2024): https://www.junaideffendi.com/p/linkedin-data-tech-stack
- [S9] Joseph Lin, Lincong Li, "TopicGC: how LinkedIn cleans up unused Kafka metadata", LinkedIn Engineering Blog (November 2022): https://engineering.linkedin.com/blog/2022/topicgc_how-linkedin-cleans-up-unused-metadata-for-its-kafka-clu
- [S10] Todd Palino, Clark Haskins, Grayson Chao, Jon Bringhurst, "Burrow: Kafka consumer monitoring reinvented", LinkedIn Engineering Blog (June 2015): https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
- [S11] Efe Gencer + Kafka team, "Open-sourcing Kafka Cruise Control", LinkedIn Engineering Blog (August 2017): https://engineering.linkedin.com/blog/2017/08/open-sourcing-kafka-cruise-control
If you are managing a Kafka deployment of your own, Kpow provides observability, control, and governance for Kafka clusters across self-managed, MSK, and Confluent environments. You can try it free for 30 days.