Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How Bytedance uses Apache Kafka in production

May 23rd, 2026

xx min read

ByteDance ran Apache Kafka at a scale that eventually made rebuilding it the more practical option. At peak, its streaming data layer handles tens of TB/s of consume throughput across applications that span short-video recommendation, real-time ML training, and cross-team data integration. Rather than continue tuning a standard Kafka deployment, ByteDance built ByteMQ (BMQ): a Kafka-compatible, cloud-native replacement that separates storage from compute and now runs 99.76% of what were formerly Kafka workloads, at roughly 70% lower resource cost. Understanding how the company got there, and what the architecture looks like today, is instructive for any team planning a streaming platform at similar scale.

Company overview

ByteDance operates the portfolio of applications behind TikTok, Douyin, Toutiao, and several other consumer products. The company's platforms generate continuous high-volume user-interaction data: every view, like, share, and comment on TikTok becomes a signal that feeds recommendation models, content ranking, and advertising systems in near real time. That combination of volume and latency sensitivity pushed ByteDance toward event streaming infrastructure early in the company's growth.

Kafka was adopted as the primary message bus for event and log collection. It served as the backbone for stream processing pipelines, online model training, and data integration across ByteDance's business units. Over time, the throughput requirements outgrew what a standard Kafka deployment could serve cost-effectively, and the engineering team began work on a replacement. The timeline below covers the key milestones from the documented record.

Key Kafka milestones:

Pre-2022: Kafka operating as ByteDance's primary event and log collection bus
September 2022: Monolith paper published, documenting Kafka's role in TikTok's real-time recommendation training pipeline
2023: StreamOps published at VLDB, describing runtime management of tens of thousands of Flink jobs against Kafka-compatible queues
November 2024: ByteMQ paper published at ACM SoCC; 99.76% Kafka-to-BMQ migration reported complete; 70% resource cost reduction achieved
February 2026: StreamShield paper published; Flink cluster now sustains 70,000+ concurrent streaming jobs across 11 million+ resource slots

ByteDance's Kafka use cases

Event and log collection

Kafka served as ByteDance's central data hub for collecting events and user-activity logs across its applications, covering online model training inputs, stream data processing, and real-time analytics. This is the foundational use case that drove initial adoption and still underpins the workloads now running on ByteMQ.

Real-time recommendation training

TikTok's recommendation system is built on Monolith, ByteDance's distributed real-time ML framework. Kafka plays a structural role in the training pipeline: one queue carries raw user-action events (views, likes, shares), and a second carries feature data. A Flink streaming job reads from both queues simultaneously, joining each user action with its corresponding features to produce labelled training examples. Those examples flow directly into Monolith's training parameter server, which updates model weights continuously throughout the day. This architecture allows the recommendation model to adapt to shifting user behaviour without waiting for periodic batch retraining cycles.

Stream processing

ByteDance runs one of the largest documented Apache Flink deployments in production. As of early 2026, the cluster sustains over 70,000 concurrent streaming jobs backed by more than 11 million resource slots. The message queues underlying those jobs are now ByteMQ, but the workload pattern is a direct continuation of the Kafka-based streaming infrastructure that preceded it.

Data integration

BitSail, ByteDance's open-source data integration engine, provides both Kafka source and Kafka sink connectors, enabling data synchronisation between Kafka topics and downstream stores. The system processes hundreds of trillions of records per day across ByteDance's data estate, supporting batch, streaming, and incremental sync patterns.

Scale & throughput

Peak consume rate: Tens of TB/s (ByteMQ paper, ACM SoCC 2024)
Peak produce rate: Approximately one-fifth of the consume rate (ByteMQ paper, ACM SoCC 2024)
Kafka clusters migrated to BMQ: 99.76% (ByteMQ paper, ACM SoCC 2024)
Resource cost reduction post-migration: ~70% (ByteMQ paper, ACM SoCC 2024)
Concurrent Flink streaming jobs: 70,000+ (StreamShield paper, arXiv Feb 2026)
Flink resource slots managed: 11 million+ (StreamShield paper, arXiv Feb 2026)
Data synchronised per day (BitSail): Hundreds of trillions of records (BitSail GitHub repo)

The consume-to-produce ratio of roughly five-to-one reflects ByteDance's fan-out consumption patterns, where multiple downstream consumers (recommendation, analytics, logging, model training) each read from the same upstream event streams.

ByteDance's Kafka architecture

From Kafka to ByteMQ

ByteDance's streaming architecture is best understood in two phases: Kafka as adopted, and ByteMQ as the current system. The transition happened because standard Kafka couples storage and compute: adding storage capacity requires adding brokers, and adding brokers for throughput also adds storage. At ByteDance's scale, this relationship became expensive. The company's response was to build ByteMQ, a Kafka-compatible system that breaks that coupling.

ByteMQ is Kafka API-compatible by design. Producers and consumers continue to use standard Kafka SDKs, and the Pub/Sub Engine within BMQ presents the familiar topic and consumer-group interface. This compatibility requirement was a deliberate engineering constraint, chosen to de-risk the migration. Switching infrastructure at the scale of tens of TB/s while keeping application code unchanged is a significant operational achievement.

ByteMQ architecture

Three architectural decisions define BMQ:

Storage-compute separation. Rather than writing message data to local broker disks, BMQ persists all data to ByteDance's internal Federated Distributed File System (DFS). Brokers handle routing and API serving, while the DFS handles durability. This means brokers can be scaled independently of storage, and vice versa. In ByteDance's own comparative benchmarks, a scale-out operation on clusters provisioned at approximately 1,000 TB of storage and 200 CPU cores required adding roughly 50 new BMQ brokers versus approximately 5 new brokers for an equivalent Kafka cluster: a reflection of how much more efficiently BMQ can redistribute load when brokers are not tied to their local disk state.

Adaptive resource scheduling. BMQ redistributes workloads across brokers and across multiple availability zones dynamically, balancing throughput and isolating failures without manual intervention. This addresses a common operational burden in large Kafka deployments, where partition reassignment is a heavyweight and often risky operation.

Historical data restructuring. BMQ restructures older streaming data into formats that are efficient for offline batch consumption. This allows both real-time stream processors and batch analytics jobs to consume from the same underlying data layer, without requiring separate pipelines or dedicated archival infrastructure.

Recommendation pipeline (Monolith)

The Monolith recommendation training pipeline uses Kafka as its inter-component transport with a specific two-queue architecture:

Action queue: carries user-action events from TikTok (views, likes, shares, follows)
Feature queue: carries pre-computed feature data from the feature store
A Flink online joiner reads from both queues simultaneously and joins each action event with its corresponding features to produce labelled training examples
Training examples are written to Monolith's training parameter server (training-PS), which updates model weights in real time
The inference parameter server (inference-PS) serves live traffic and syncs periodically from the training-PS

This design keeps the training loop continuously fed from live user-interaction data, rather than waiting for a batch export cycle.

Producer architecture

ByteDance's documented architecture does not describe specific Kafka producer configuration parameters (acks, batching, compression) in public sources. The Monolith pipeline uses Kafka producers to write action events and feature data into their respective queues as part of the online training loop.

Consumer architecture

Consumer groups in BMQ follow the same semantics as Kafka consumer groups: consumers are organised into groups, with each partition assigned to one consumer in the group at a time. The Kafka API compatibility layer in BMQ's Pub/Sub Engine means existing consumer group code runs unchanged against BMQ.

Stream processing

Apache Flink is ByteDance's primary stream processing engine. The Flink cluster is managed by two internal systems: StreamOps handles runtime lifecycle management (auto-scaling, straggler detection, automated recovery), and StreamShield provides resiliency mechanisms at the engine and cluster levels. Both systems interact with BMQ as the message source.

Special techniques & engineering innovations

Kafka-compatible replacement at scale (ByteMQ)

The ByteMQ migration is itself a notable engineering decision. Rather than adopting a third-party Kafka-compatible system or moving to a different messaging model, ByteDance built and operated its own replacement. The design choice to maintain full Kafka SDK compatibility meant that migration could proceed incrementally across 99.76% of clusters without requiring application changes. The 70% resource cost reduction suggests the storage-compute coupling in standard Kafka was contributing significantly to infrastructure spend at ByteDance's scale.

Online continuous training via Kafka (Monolith)

The two-queue Kafka architecture in Monolith enables a training loop that runs continuously from live user interactions. The Flink join step is critical: by joining action events with features in a streaming job rather than storing features alongside events, the system keeps feature data up to date without having to re-embed stale feature values into the event stream at write time. This produces training examples that reflect the feature state at the time the action occurred rather than a fixed snapshot.

Collisionless embedding table

Monolith uses a hash-map-based embedding table rather than a fixed-size embedding table. Embeddings expire when unused (avoiding accumulation of long-tail entities), and frequency filtering prevents low-signal features from consuming memory. This keeps the embedding memory footprint bounded without sacrificing model quality from hash collisions, which are common in the fixed-size approach.

Region checkpointing (StreamShield)

Standard Flink checkpointing operates at the job level: a checkpoint failure triggers recovery for the entire job. StreamShield introduces region-level checkpointing, which narrows the recovery scope to the portion of the job that actually failed. This change improved checkpoint success rates in ByteDance's production Flink cluster from 53.9% to 93.5%.

Adaptive shuffle (StreamShield)

StreamShield's Adaptive Shuffle enables dynamic, load-aware data redistribution across Flink task managers. When a task manager becomes a bottleneck, the shuffle layer redistributes partitions to underutilised nodes without a full job restart. This reduces the frequency of hot-spot-induced failures under uneven workloads.

Operating Kafka at scale

StreamOps: runtime lifecycle management: ByteDance built StreamOps, published at VLDB 2023, as a cloud-native control plane for the Flink cluster. StreamOps runs three control policies concurrently across all streaming jobs: an auto-scaler that adjusts task manager allocation in response to throughput changes, a straggler detector that identifies slow-running tasks within jobs, and a job doctor that diagnoses and remediates common failure patterns automatically. StreamOps manages tens of thousands of concurrent jobs, many of which run continuously for days or longer against BMQ queues. Operating at that concurrency level without a centralised control plane would require proportionally more manual intervention per incident.

StreamShield: production resiliency: StreamShield, published February 2026, extends StreamOps with a resiliency layer built around four mechanisms: runtime optimisation (adaptive shuffle, autoscaling), fine-grained fault tolerance (region checkpointing, single-task recovery), hybrid replication (combining passive and active strategies for cluster-level fault tolerance), and a testing and release pipeline that includes chaos testing, micro-benchmarking, macro-benchmarking, and online probe tasks before each production deployment. As of the paper's publication, StreamShield sustains over 70,000 concurrent streaming jobs and manages more than 11 million resource slots, with thousands of recovery events per day handled automatically.

Job startup overhead: StreamShield reduced job startup overhead from approximately 500 seconds to approximately 200 seconds at scale, measured in test configurations with thousands of task managers. Faster restarts reduce the window during which a failed job is not processing data, which matters for latency-sensitive pipelines.

Data integration with BitSail: ByteDance open-sourced BitSail, a distributed data integration engine that supports Kafka as both a source and a sink. It is deployed across cloud-native and on-premises environments within ByteDance, and the repository documents support for hundreds of trillions of daily records. For teams looking to evaluate it, the project is available at the BitSail GitHub repository.

Challenges & how they solved them

Kafka could not scale cost-effectively at ByteDance's volume

Storage and compute are coupled in standard Kafka; adding storage requires adding brokers, and vice versa. At ByteDance's volume, this relationship meant resource costs scaled proportionally with throughput in a way that became unsustainable. ByteDance built ByteMQ with storage-compute separation, using its internal Federated DFS as the storage layer, and maintained full Kafka API compatibility to allow migration without application changes. The migration reached 99.76% of Kafka clusters with approximately 70% reduction in resource costs.

Flink job failures at tens of thousands of concurrent jobs required extensive manual intervention

Job-level recovery scope meant any failure required restarting the full job, and without a centralised control plane, intervention at that concurrency level was proportionally costly. ByteDance built StreamOps for automated lifecycle management and StreamShield for region-level checkpointing and adaptive shuffle. Checkpoint success rate improved from 53.9% to 93.5%, job startup overhead fell from approximately 500 seconds to approximately 200 seconds, and thousands of daily recovery events are now handled automatically.

Recommendation model freshness lagged user behaviour

Batch retraining cycles introduced hours of delay between a user action and the corresponding model weight update. Monolith's two-queue Kafka architecture feeds a Flink online joining step that produces training examples directly from live event streams, allowing model weights to update continuously throughout the day.

Fixed-size embedding tables caused hash collisions and degraded recommendation quality

Multiple distinct feature entities mapped to the same embedding slot when the fixed-size table was full. Monolith's collisionless hash-map embedding table assigns each entity its own slot, with expirable embeddings and frequency filtering to keep the memory footprint bounded. This eliminates hash collisions without trading off model quality for sparse features.

Full tech stack

CategoryToolsNotesMessage broker (original)Apache KafkaEvent and log collection; inter-service messaging; ML pipeline transportMessage broker (current)ByteMQ (BMQ) — internalKafka API-compatible; storage-compute separated; backed by ByteDance's Federated DFSPersistent storage (BMQ)ByteDance Federated DFS — internalDistributed file system backing BMQ's storage layerStream processingApache Flink70,000+ concurrent jobs; real-time feature joining, analytics, online model trainingOnline ML trainingMonolith — internal (open source)Flink-based online joiner reads two Kafka queues to produce training examples for continuous recommendation model trainingData integrationBitSail — open sourceKafka source and sink; hundreds of trillions of records/dayStreaming runtime managementStreamOps — internalAuto-scaler, straggler detector, job doctor for Flink jobs at scaleStreaming resiliencyStreamShield — internalRegion checkpointing, adaptive shuffle, hybrid replication, chaos-tested release pipeline

Key contributors

Yancan Mao: Lead author on both ByteMQ (ACM SoCC 2024) and StreamOps (VLDB 2023). ByteMQ paper, StreamOps paper
Zhanghao Chen: Co-author, StreamOps. StreamOps paper
Ruohang Yin, Liyuan Lei, Peng Ye, Shengfu Zou, Shizheng Tang, Yunzhe Guo, Ye Yuan, Xiaochen Yu, Bo Wan, Yunfei Gong, Changli Gao, Guanghui Zhang, Jian Shen, Rui Shi: Co-authors, ByteMQ (ByteDance Inc.). ByteMQ paper
Yong Fang, Yuxing Han, Meng Wang, Yifan Zhang, Yue Ma, Chi Zhang: Authors, StreamShield. StreamShield paper

Key takeaways for your own Kafka implementation

Storage-compute coupling becomes expensive at high throughput. ByteDance's decision to build a Kafka-compatible replacement rather than continue scaling standard Kafka was driven by the proportional growth in resource cost. If your Kafka storage and compute are scaling together but you primarily need one or the other, evaluating storage-offload options (tiered storage, object storage backends) earlier may avoid a more significant architectural change later.
Kafka API compatibility is a practical migration constraint. ByteMQ's decision to maintain full Kafka SDK compatibility allowed ByteDance to migrate 99.76% of clusters without touching application code. If you are designing internal messaging infrastructure, preserving the Kafka protocol interface keeps your migration options open.
Real-time training pipelines benefit from separated queues with a streaming joiner. Monolith's two-queue approach keeps raw events and feature data on separate Kafka topics, joined in real time by a Flink job. This pattern avoids embedding feature state into the event at write time, which would require re-ingesting events whenever features change.
Job-level recovery scope does not work at tens of thousands of concurrent jobs. ByteDance's region checkpointing approach in StreamShield reduced the blast radius of individual failures and improved checkpoint success rates from 53.9% to 93.5%. If you are scaling a Flink deployment significantly, the granularity of your checkpointing and recovery scope is worth reviewing before it becomes a bottleneck.
A centralised control plane for streaming jobs reduces per-incident manual work. StreamOps manages auto-scaling, straggler detection, and automated recovery across ByteDance's Flink cluster. At smaller concurrency levels, per-job monitoring is manageable; at tens of thousands of jobs, automation at the control-plane level is necessary.

Sources & further reading

Yancan Mao et al. (ByteDance) — ByteMQ: A Cloud-native Streaming Data Layer in ByteDance — ACM SoCC 2024: https://dl.acm.org/doi/10.1145/3698038.3698536
Yancan Mao, Zhanghao Chen et al. (ByteDance) — StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance — VLDB 2023: https://dl.acm.org/doi/abs/10.14778/3611540.3611543
Yong Fang et al. (ByteDance) — StreamShield: A Production-Proven Resiliency Solution for Apache Flink at ByteDance — arXiv, February 2026: https://arxiv.org/abs/2602.03189
ByteDance AI Lab — Monolith: Real Time Recommendation System With Collisionless Embedding Table — arXiv, September 2022: https://arxiv.org/abs/2209.07663
ByteDance — BitSail: Distributed high-performance data integration engine — GitHub: https://github.com/bytedance/bitsail
Apache Software Foundation — Powered By Apache Kafka — ByteDance entry: https://kafka.apache.org/powered-by/

If you want visibility into the consumer groups, lag metrics, and topic throughput in your own Kafka cluster, give Kpow a try with a free 30-day trial. It connects to any Kafka cluster in minutes and deploys via Docker, Helm, or JAR.

‍

Kafka cluster monitoring

A detailed guide to Kafka producer monitoring

Kafka consumer monitoring and performance tuning

Kafka broker monitoring

Apache Kafka architecture: a complete guide to internals, components, and deployment

Kafka monitoring: a complete guide for platform engineers

Data governance for Kafka: introducing lineage support in Factor Platform

Defense in depth: unifying RBAC and data policies for transparent governance

CMAK: Review, pricing, and best alternatives in 2026

Kadeck: Review, pricing, and best alternatives in 2026

AKHQ: Review, pricing, and best alternatives in 2026