Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

Apache Kafka architecture: a complete guide to internals, components, and deployment

Table of contents

Factor House
May 30th, 2026
xx min read

How Kafka works: a mental model

Kafka is a distributed commit log. That framing matters: it is not a message queue, not a database, and not a pub-sub system, though it borrows ideas from all three. Understanding how it works starts with understanding what the log is and why that design choice dictates everything else.

At its core, Kafka stores records in topics. Each topic is divided into one or more partitions, and each partition is an ordered, immutable sequence of records written to disk. Producers append records to the end of a partition; consumers read from any position in the log by specifying an offset. Nothing is deleted when a consumer reads a record - the record stays in the log until a retention policy removes it. Multiple consumers reading the same partition at the same time each read independently, at their own position.

This design gives Kafka two properties that are difficult to achieve together in traditional message brokers: high throughput and replay. Because writes are sequential disk appends, they are fast. Because data persists, any consumer can re-read historical data, replay a stream from a specific point, or bootstrap a new consumer group without disturbing others.

Kafka separates producers from consumers entirely. A producer writes records to Kafka; it has no knowledge of which consumers exist or where they are reading. A consumer reads from Kafka; it has no interaction with producers. The log in between is the only coupling point. This decoupling is why Kafka scales: adding a new consumer does not affect producers or existing consumers, and a producer can continue writing whether consumers are online or not.

The cluster is a set of brokers. Each broker stores a subset of partitions. Some partitions are replicated across multiple brokers for durability. The cluster is managed by a controller, which tracks partition leadership and handles broker failures. As of Kafka 4.0, that controller runs on an internal consensus protocol called KRaft, with no external coordination dependency.

The Kafka log

A Kafka partition is a log. Physically, it is a directory on the broker's filesystem. Inside that directory are one or more log segments, each of which is a pair of files: a .log file containing the raw record data, and an .index file that maps record offsets to byte positions in the .log file. A third file, the .timeindex, maps timestamps to offsets, enabling time-based lookups.

Segments are rolled when they exceed a size threshold (controlled by log.segment.bytes, defaulting to 1 GB) or when they age past a configured interval (log.roll.ms). Only the most recent segment in a partition is active and writable; older segments are sealed and can be cleaned or deleted by background threads.

Kafka uses two separate retention mechanisms:

Delete retention: Records older than a configured age (log.retention.ms) or beyond a size cap (log.retention.bytes) are deleted. The broker removes entire segments - it does not delete individual records from within a segment. This is the default for most topics.

Compacted retention: For topics with cleanup.policy=compact, Kafka retains only the most recent record for each key. A background thread called the log cleaner compares dirty segments (containing duplicate keys) against clean segments and writes a compacted output. A null-valued record (a tombstone) marks a key for deletion. The cleaner thread is governed by log.cleaner.min.dirty.ratio, which sets the minimum fraction of dirty log that must exist before compaction runs. Compacted topics are appropriate when you care about the latest state of a key rather than the full history: a user profile topic, for example, or a Kafka Streams changelog topic.

You can apply both policies to the same topic using cleanup.policy=compact,delete. Compaction removes redundant keys; deletion caps total retention by time or size.

The log is also the unit of replication. Followers replicate the partition leader's log by issuing fetch requests. The high watermark marks the point up to which all in-sync replicas have confirmed they have written. Consumers only read up to the high watermark, so they never see records that might be lost if a leader failure occurred before full replication.

Core components of Kafka

Component Role
Broker Stores and serves partitions; handles producer and consumer connections
Topic Logical category for records; divided into partitions
Partition Ordered, append-only log; unit of parallelism and replication
Producer Writes records to topics; controls partitioning and batching
Consumer Reads records from partitions; tracks position via offsets
Consumer group Set of consumers that share the work of reading a topic
KRaft controller Manages cluster metadata and leader election (replaces ZooKeeper)
Schema Registry Enforces schema contracts for record serialization
Kafka Connect Integrates external systems via source and sink connectors
Kafka Streams Client library for stateful stream processing inside the cluster
ksqlDB SQL interface for stream processing
Kafka UI Operational visibility layer

These components divide into three layers: the core data plane (brokers, topics, partitions, producers, consumers), the control plane (KRaft controller), and the ecosystem (Schema Registry, Connect, Streams, ksqlDB, and operational tooling). The diagram below shows how they relate.

Topics and partitions

A topic is a named stream of records. It is the unit of logical organization in Kafka: producers target a topic, consumers subscribe to one or more topics. Physically, a topic is implemented as one or more partitions spread across brokers.

Partition count and parallelism: The number of partitions in a topic determines the maximum read parallelism within a consumer group. A partition can only be assigned to one consumer in a group at a time, so a consumer group with more consumers than partitions will have idle consumers. Partition count is a durable decision: you can increase it after creation, but you cannot decrease it, and increasing it can break ordering guarantees for keyed records (because keys are hashed to partitions, and adding partitions changes the mapping).

A reasonable starting point is one partition per expected peak parallel consumer, with headroom for growth. For high-throughput topics, also account for per-partition memory overhead on brokers and per-partition file descriptor usage.

Partition assignment: When a producer sends a record without a key, Kafka assigns it to partitions using a sticky batching strategy: the producer fills a batch for one partition before moving to the next, which improves compression and throughput. When a record has a key, Kafka hashes the key to a partition using the DefaultPartitioner (murmur2 hash). Records with the same key always go to the same partition, preserving per-key ordering. You can supply a custom Partitioner implementation if you need different routing logic.

Compacted versus delete topics: Topics use either delete or compact cleanup policy, or both. Delete topics are appropriate for event streams where age determines relevance: logs, metrics, activity events. Compacted topics are appropriate for changelog semantics where the latest value per key is what matters: materialized state, lookup tables, Kafka Streams state stores.

Producers

The producer lifecycle for a single record moves through four stages: serialization, partitioning, batching, and network send.

Serialization and partitioning: The producer serializes the key and value using configured serializers (byte array, string, Avro, Protobuf, and so on) before anything else. After serialization, the partitioner assigns the record to a partition.

Batching: Records destined for the same partition and broker accumulate in a RecordBatch in the producer's send buffer. Two configurations govern batch behavior:

  • batch.size: the maximum size in bytes of a single batch. The producer sends when the batch is full or when linger.ms elapses.
  • linger.ms: the maximum time the producer waits to fill a batch before sending. At linger.ms=0, the producer sends immediately. Increasing this trades latency for throughput by allowing larger batches.

Compression: Setting compression.type (gzip, snappy, lz4, zstd) compresses batches before sending. Compression reduces network I/O and broker storage at the cost of CPU. LZ4 and Zstandard offer the best throughput-to-compression-ratio tradeoff for most workloads.

Delivery guarantees and acks: The acks configuration controls when the leader considers a write complete:

  • acks=0: no acknowledgment; fire and forget.
  • acks=1: the leader acknowledges after writing to its local log, before followers replicate. A leader failure before replication means data loss.
  • acks=all (or -1): the leader waits for all in-sync replicas to acknowledge. Since Kafka 3.0 this is the default. Combined with min.insync.replicas=2 and replication.factor=3, this provides strong durability guarantees.

Idempotent producers and exactly-once semantics: Setting enable.idempotence=true assigns the producer a persistent producer ID and sequences each record batch. The broker deduplicates retried batches using the producer ID and sequence number, preventing duplicate records during transient network failures. Idempotent producers require acks=all and retries > 0; Kafka enforces these automatically when idempotence is enabled.

For end-to-end exactly-once semantics across multiple topics - including writes paired with consumer offset commits - use the transactional API: call initTransactions(), begin a transaction, produce and commit consumer offsets within the transaction, then commit or abort.

A common misconception is that exactly-once requires max.in.flight.requests.per.connection=1. This was true for early versions of idempotent producers. Since Kafka 0.11, idempotent producers safely handle up to five in-flight requests per partition while maintaining ordering and deduplication.

Brokers and the cluster

A Kafka cluster is a set of broker processes, each running on a separate machine or VM. Each broker is responsible for the partitions it hosts: writing records to local disk as producers send them, serving fetch requests from followers and consumers, and enforcing producer quotas, consumer quotas, and topic-level access controls.

Partition leadership: Every partition has a leader and zero or more followers. The leader handles all produce and fetch requests for that partition. Followers issue FetchRequests to the leader, replicating the log independently. When a broker fails, the controller elects a new leader from the in-sync replicas of each partition that broker led.

Controller responsibilities: In a KRaft cluster, a subset of nodes forms the controller quorum. The active controller is the single node responsible for metadata decisions: partition leader elections, ISR updates, broker registrations, topic creation, and ACL changes. The other nodes in the quorum maintain up-to-date replicas of the metadata log and can assume the active role in milliseconds if the active controller fails. The controller does not handle data reads or writes.

Rack awareness: Setting broker.rack on each broker causes the controller to spread partition replicas across distinct failure domains. In a three-AZ deployment with replication factor 3, each AZ gets one replica. A single AZ failure does not take the partition offline.

KRaft: Kafka without ZooKeeper

KRaft (Kafka Raft) is the internal consensus protocol that manages Kafka cluster metadata. It replaced Apache ZooKeeper as the coordination layer, removing a major operational dependency and resolving fundamental scaling limits in the original architecture.

Why ZooKeeper was removed: The ZooKeeper-based architecture stored cluster metadata externally. Whenever partition state changed, the active controller wrote to ZooKeeper, waited for ZooKeeper consensus, received a watch notification, updated its local memory, and then pushed update RPCs to every affected broker. This push-based propagation created two compounding problems. First, it imposed a hard ceiling on cluster scale: as partition counts grew past roughly 200,000, ZooKeeper's metadata handling degraded, with coordination overhead growing substantially at scale. Second, controller failover was expensive: a newly elected controller had to pull and reconstruct its entire metadata cache from ZooKeeper before it could make any decisions, causing prolonged unavailability during failover on large clusters.

What KRaft does instead: KRaft treats cluster metadata as an append-only log stored in an internal topic called __cluster_metadata. The active KRaft controller is the leader of this log. Other controllers in the quorum act as followers, maintaining hot in-memory copies of the metadata. When a controller fails, a standby takes over in milliseconds because it already has the full metadata state in memory - no rebuild from an external store.

Rather than pushing metadata updates to brokers, KRaft uses a pull model. Brokers issue incremental FetchRequests to the active controller, pulling only the changes they have not yet seen. Because metadata changes are consumed from a single ordered log, every broker processes events in the same sequence, ensuring eventual consistency and preventing the metadata divergence that was possible with the legacy push model.

The controller quorum: The KRaft quorum uses a modified Raft consensus protocol (defined in KIP-595). To tolerate f controller failures, you need 2f+1 voters: a 3-node quorum tolerates 1 failure; a 5-node quorum tolerates 2. For production, run an odd number of dedicated controller nodes - not combined broker-controller nodes - to insulate the consensus layer from broker-side I/O pressure and garbage collection pauses. Co-locating controllers with data brokers exposes the control plane to the same resource contention it is trying to isolate.

Broker state machine: Each broker in a KRaft cluster transitions through four states:

State Description
Offline The broker process is stopped or is completing initial startup
Fenced The broker is running but excluded from client RPCs - entered during startup or when it loses contact with the controller
Online The broker is fully operational, registered with the active controller, and authorized to serve traffic
Stopping The broker is in graceful shutdown; the controller migrates its partition leadership before taking it offline

If a broker loses contact with the controller and cannot send heartbeats, the controller fences it in the metadata log. Other brokers, consuming this event, redirect client traffic away from the fenced broker immediately.

Snapshotting (KIP-630): Because __cluster_metadata is an append-only log, it would grow indefinitely without compaction. KRaft uses a snapshotting mechanism: when enough records have accumulated since the last snapshot, the active controller serializes its in-memory metadata state to a checkpoint file. Followers download snapshots chunk-by-chunk via FetchSnapshot RPCs, validate the CRC, and load the new state. A newly joined or lagging broker can bootstrap quickly by fetching the latest snapshot and replaying only the log entries after the snapshot's end offset.

Dynamic quorum membership (KIP-853): Previously, the controller quorum was statically configured via controller.quorum.voters. KIP-853 added support for adding or removing controller voters without a cluster restart, using AddVoter and RemoveVoter RPCs. Only one membership change is permitted at a time to prevent split-brain scenarios. This makes controller hardware replacement (after a disk failure, for example) an online operation.

Version timeline:

Kafka version Date Status
2.8 Early 2021 KRaft introduced as experimental
3.3 August 2022 KRaft marked production-ready (KIP-833); ZooKeeper deprecated
3.6 September 2023 ZK-to-KRaft live migration promoted to GA
3.9 May 2025 Final 3.x release; dynamic KRaft quorums (KIP-853) added
4.0 March 18, 2025 ZooKeeper support completely removed

ZooKeeper mode is not available in Kafka 4.0 and later. If you are running a ZooKeeper-based cluster on 3.x, migration to KRaft is required before upgrading to 4.x. The live migration process (KIP-866) runs in four phases - metadata copy, dual-write hybrid runtime, rolling broker reconfiguration, and finalization - and supports rollback through Phase 3.

Replication and fault tolerance

Kafka replication distributes partition data across multiple brokers to survive broker failures without data loss or service interruption.

In-sync replicas (ISR): Each partition maintains an ISR set: the replicas that are considered current with the leader. A follower stays in the ISR as long as it has fetched from the leader within replica.lag.time.max.ms (default: 30 seconds). If a follower stops fetching or falls behind, the leader removes it from the ISR by sending an AlterPartitionRequest to the controller. The controller commits the shrunk ISR to the metadata log. Crucially, only the persisted ISR matters for leader election: if a leader fails before its latest ISR shrink request reaches the controller, the controller elects from the last committed ISR state.

What happens when a broker fails: In KRaft mode, if a broker misses its heartbeat window with the active controller, the controller fences the broker by appending a FenceBrokerRecord to __cluster_metadata. Brokers consuming the metadata log discover the fence and redirect clients away. The controller then initiates leader elections for all partitions the failed broker led.

The controller selects a new leader from a priority hierarchy:

Priority Candidate group Behavior
1 In-sync replicas (ISR) Selects the first active, unfenced ISR member. No data loss.
2 Eligible leader replicas (ELR) If ISR is empty, selects from the ELR set (KIP-966).
3 Last known leader If ELR is also empty, attempts the last known unfenced leader.
4 Out-of-sync replicas If `unclean.leader.election.enable=true`, elects any surviving replica. Data loss occurs.
5 None If unclean elections are disabled and no clean replica exists, the partition stays offline.

Unclean leader election: With unclean.leader.election.enable=false (the default), a partition stays offline rather than elect an out-of-sync replica. This protects data integrity at the cost of availability. With the setting enabled, the out-of-sync replica becomes leader, the ISR is reset to contain only that new leader, and any records acknowledged by the old leader but not replicated to the new leader are permanently lost. The controller marks the partition recovery state as RECOVERING. This setting should remain false for any topic where losing acknowledged writes is unacceptable. If you need to override it, do so at the topic level, not cluster-wide.

The default configuration durability gap: There is a trap worth naming directly. Since Kafka 3.0, acks defaults to all. However, min.insync.replicas still defaults to 1, and default.replication.factor defaults to 1. With min.insync.replicas=1, acks=all behaves identically to acks=1: the leader acknowledges a write as soon as its own log is written, because the ISR contains only itself. If two follower replicas are temporarily removed from the ISR (due to network issues), the leader accepts writes with acks=all and immediately acknowledges them. If that leader then crashes before the followers rejoin, those acknowledged records are gone.

For any production topic where data loss is unacceptable, set:

replication.factor=3
min.insync.replicas=2

With this configuration and acks=all, a write is only acknowledged after two brokers have it on disk. If two brokers fail simultaneously and the ISR drops to 1, producers receive a NotEnoughReplicasException rather than a silent acknowledgment that could be lost.

Leader epoch and truncation (KIP-101): Kafka uses a leader epoch - a monotonically increasing counter incremented on each leadership change - to prevent log divergence during follower recovery. When a follower restarts, it sends an OffsetForLeaderEpochRequest to the current leader, which responds with the log end offset for the follower's epoch. The follower truncates only to that offset. Before KIP-101, followers truncated to their last local high watermark, which could cause data loss under a double failover scenario where a follower restarted and truncated before a second broker failure.

Consumers and consumer groups

Kafka consumers read records from partitions by subscribing to topics and issuing FetchRequests to the broker that leads each assigned partition.

Offsets and commits: A consumer's position in a partition is its offset. Kafka stores committed offsets in the internal __consumer_offsets topic, keyed by (group.id, topic, partition). When a consumer restarts, it resumes from its last committed offset. Two configurations govern offset behavior:

  • auto.offset.reset: what to do when no committed offset exists. earliest reads from the beginning; latest reads from the current end; none throws an exception.
  • enable.auto.commit: whether the consumer automatically commits offsets on a schedule. Auto-commit is convenient but can commit before processing is complete, causing duplicate processing or data loss on restart. For reliable at-least-once guarantees, set enable.auto.commit=false and commit offsets manually after successful processing.

Consumer groups: A consumer group is a set of consumers identified by a shared group.id. Each partition in a subscribed topic is assigned to exactly one consumer in the group at a time. A group with 10 consumers and a topic with 10 partitions gets one partition per consumer. A group with 10 consumers and 4 partitions has 6 idle consumers: partition count is the ceiling on consumer-level parallelism within a group.

Rebalancing: When a consumer joins or leaves a group, or when partition count changes, Kafka reassigns partitions across group members. The protocol used for rebalancing determines how disruptive this is:

  • Eager rebalancing (legacy): all consumers stop consuming and release all their partitions. The group coordinator assigns partitions from scratch. This is a stop-the-world pause for the entire group.
  • Cooperative-sticky rebalancing (recommended): consumers retain as many existing partition assignments as possible. Only partitions that need to move change hands, and they do so incrementally. Consumers that keep their partitions continue processing throughout the rebalance.

To enable cooperative-sticky rebalancing, set partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor. This is not the default as of Kafka 3.x. Switching strategies in a running group requires a one-time migration: first add CooperativeStickyAssignor alongside the existing strategy, perform a rolling application restart, then remove the old strategy and restart again.

Consumer lag: Consumer lag is the difference between the log end offset of a partition and the consumer group's last committed offset for that partition. Lag represents the amount of unprocessed data. A persistently growing lag means the consumer cannot keep up with the producer and is a leading indicator of downstream problems. Monitoring lag per consumer group and per partition is one of the most operationally important signals in any Kafka-backed system.

Cluster architecture and management

Adding brokers: Adding a broker to a running cluster is non-disruptive, but new brokers are not assigned existing partitions automatically. Run a partition reassignment using kafka-reassign-partitions.sh to generate and submit a reassignment plan. The plan is executed by the controller while the cluster continues serving traffic. Throttle the reassignment using kafka-configs.sh to limit the impact on produce and consume latency during the move.

Rolling upgrades: Kafka supports rolling upgrades without downtime. Upgrade one broker at a time, allowing it to re-join and reach full replication before moving to the next. Keep inter.broker.protocol.version (IBP) at the version of the oldest broker until all brokers are upgraded, then advance the IBP. In KRaft clusters, upgrade controllers before brokers.

Log retention and disk management: Retention is managed per-topic via log.retention.ms, log.retention.bytes, and cleanup.policy. Monitor per-broker disk usage; uneven partition distribution (common when some partitions are significantly larger than others) can cause a single broker to reach capacity before others. Consider tiered storage (see "What's changing in Kafka architecture") to offload older segments to object storage.

Key JMX metrics:

Metric Why it matters
`UnderReplicatedPartitions` Non-zero means at least one follower is lagging or offline
`OfflinePartitionsCount` Non-zero means partitions with no available leader; immediate action required
`ActiveControllerCount` Should be exactly 1 cluster-wide
`UncleanLeaderElectionsPerSec` Should be 0; non-zero means data loss is occurring
`UnderMinIsrPartitionCount` Non-zero means writes with `acks=all` are being rejected
Consumer lag Per consumer group, per partition; growing lag indicates a consumer falling behind

Throughput versus latency tuning: The primary levers are on the producer side (linger.ms and batch.size) and the consumer side (fetch.min.bytes and fetch.max.wait.ms). Increasing linger.ms and batch.size improves throughput by sending larger batches at the cost of higher produce latency. Increasing fetch.min.bytes and fetch.max.wait.ms reduces the number of fetch requests at the cost of higher read latency. Neither set of parameters should be tuned in isolation from the other side of the pipeline.

Security

Kafka's security model covers authentication, authorization, and encryption. All three require explicit configuration; Kafka ships with no authentication enabled by default.

Authentication (SASL): Kafka supports four SASL mechanisms:

  • PLAIN: Username and password transmitted in plaintext. Acceptable only when TLS is also enabled to encrypt the connection. Simple to configure; not suitable for environments where credentials might be intercepted at the network layer.
  • SCRAM-SHA-256 / SCRAM-SHA-512: Salted challenge-response. Credentials are stored as salted hashes and managed via the Kafka admin API (in KRaft clusters). Well-suited for most deployments without existing Kerberos infrastructure.
  • GSSAPI (Kerberos): Integrates with Active Directory or MIT Kerberos. Required in environments with existing Kerberos infrastructure. Operationally complex to configure correctly.
  • OAUTHBEARER: Delegates authentication to an OAuth 2.0 identity provider. Well-suited for cloud-native environments where OIDC is already in use. Requires a custom or third-party token validator implementation.

For new deployments without existing Kerberos infrastructure, SCRAM-SHA-512 over TLS is the most practical starting point. For cloud-native environments with identity provider integration, OAUTHBEARER is increasingly common.

Authorization (ACLs): Kafka ACLs grant or deny operations (Read, Write, Create, Delete, Describe, and others) on resources (topics, consumer groups, clusters) to principals (users, service accounts). ACLs are managed via kafka-acls.sh or the admin API. In KRaft clusters, use StandardAuthorizer rather than the legacy AclAuthorizer, which required ZooKeeper.

Encryption in transit: Configure TLS on broker listeners by setting a keystore and truststore. All inter-broker communication should also use TLS (security.inter.broker.protocol=SSL). mTLS (mutual TLS) is used in environments that want the broker to authenticate clients via client certificates, rather than relying on SASL.

Network isolation: Kafka brokers should not be exposed to the public internet. Place them in private subnets, restrict inbound access to known application IP ranges via security groups or firewall rules, and use separate listener configurations if you need to expose Kafka to both internal and external clients with different security settings (listeners and advertised.listeners).

Schemas and schema registry

Without schema management, a Kafka topic is a sequence of untyped bytes. Any producer can write any format; any consumer must be written to match. This implicit contract fails silently: a producer change that modifies a field type or removes a required field can break consumers without warning, and the break is discovered at deserialize time, not at produce time.

A schema registry solves this by storing schemas centrally and embedding a schema ID in each record. Before a producer serializes a record, it registers the schema (if not already registered) and encodes the schema ID as a prefix in the record value, using the Confluent wire format: a magic byte, a 4-byte schema ID, then the serialized payload. Consumers look up the schema by ID before deserializing. Incompatible schema changes are rejected at produce time.

Supported formats: All major schema registries support Avro, Protobuf, and JSON Schema. Avro is the most widely deployed in Kafka ecosystems; Protobuf is common in organizations already using gRPC; JSON Schema is convenient but produces larger payloads and is less strict about types.

Compatibility modes: The registry enforces schema evolution rules per-subject:

  • BACKWARD: New schema can read data written by the previous schema. Safe for consumers to upgrade before producers.
  • FORWARD: Previous schema can read data written by the new schema. Safe for producers to upgrade before consumers.
  • FULL: Both backward and forward compatible. The safest setting for most production topics.
  • NONE: No compatibility checking.

Registry implementations:

  • Confluent Schema Registry: The most widely deployed implementation. Defines the wire format that most Kafka serializers expect. Available as open source under a mix of Apache 2.0 and Confluent Community License terms, or as a managed service in Confluent Cloud.
  • Apicurio Registry: Fully Apache 2.0 licensed. Supports Avro, Protobuf, and JSON Schema. Compatible with the Confluent wire format, making it a drop-in alternative for clients using standard Confluent serializers. Maintained by Red Hat.
  • AWS Glue Schema Registry: Managed service integrated with MSK and other AWS services. Supports Avro and JSON Schema; Protobuf support was added more recently. Uses a different wire format from Confluent, requiring Glue-specific serializers.

For self-managed clusters or environments where license terms matter, Apicurio is the best-maintained open-source alternative to Confluent Schema Registry.

Kafka Connect

Kafka Connect is a framework for building and running connectors that move data between Kafka and external systems. Rather than writing custom producer or consumer applications for each system integration, you configure a connector with a set of properties, and Connect handles serialization, offset tracking, error handling, and parallelism.

Source connectors read from an external system and write to a Kafka topic. A PostgreSQL CDC source connector using Debezium, for example, tails the write-ahead log and produces records to Kafka as each committed row change occurs. Sink connectors read from a Kafka topic and write to an external system: an S3 sink connector writes records to S3, partitioned by date or hour.

Deployment modes:

  • Standalone mode: A single Connect worker process. No fault tolerance. Suitable for development or simple single-node pipelines.
  • Distributed mode: Multiple Connect workers form a group. Tasks are distributed across the group; if a worker fails, tasks are reassigned to surviving workers. This is the production deployment model.

Exactly-once semantics in Connect: Connect supports exactly-once delivery for source connectors in distributed mode (since Kafka 2.6), using the transactional producer API combined with offset storage in Kafka. The connector's source offsets and the produced records are written in the same transaction, so a crash mid-write is recovered cleanly on restart. Not all connectors support this: the connector must manage offsets through the Connect framework rather than externally. Exactly-once for sink connectors depends on whether the target system can participate in transactions.

Connector ecosystem: Confluent Hub hosts several hundred connectors. Debezium provides production-grade CDC connectors for PostgreSQL, MySQL, SQL Server, MongoDB, and others, and is widely used for event sourcing and data replication patterns. Most major cloud data stores (S3, GCS, BigQuery, Redshift, Snowflake, Elasticsearch) have maintained sink connectors available.

A concrete example of a common Connect pipeline: a Debezium PostgreSQL source connector captures row changes from a transactional database and writes them to Kafka topics (one topic per table), from where an S3 sink connector archives them to object storage. Both ends are configured and managed in Connect; no custom producer or consumer code is written.

Kafka Streams

Kafka Streams is a client library for building stateful stream processing applications that read from and write to Kafka. Unlike Apache Flink or Spark Streaming, it runs inside your application process: there is no separate cluster to deploy, no cluster manager to operate, and no external state store required unless you choose to add one.

Core abstractions:

  • KStream: An unbounded stream of records. Each record is an independent event.
  • KTable: A changelog stream interpreted as a materialized table. Each record represents the latest value for a key. Supports joins and aggregations.
  • GlobalKTable: A KTable replicated in full to every application instance. Used for broadcast joins where one side is small relative to the other.

State stores: Aggregations and joins require state. Kafka Streams maintains state in local RocksDB stores, backed by changelog topics in Kafka. On restart, an application instance restores its state by replaying the changelog topic. State stores can also be queried externally via the interactive queries API, allowing the application to expose its materialized state as a queryable service.

When to use Kafka Streams versus alternatives:

  • Kafka Streams is the right choice when your processing logic is sophisticated and tied to Kafka topics, and when running a separate processing cluster adds more cost and complexity than it is worth. It handles stateful operations (joins, aggregations, windowing) well, and deploying it as a jar alongside your existing application is a significant operational simplicity advantage.
  • Apache Flink is preferable when you need to join Kafka streams with non-Kafka sources, when batch and streaming workloads need to share processing logic, or when the scale of state exceeds what RocksDB on a single JVM can comfortably handle.
  • ksqlDB provides a SQL interface over Kafka Streams. See the next section.

ksqlDB

ksqlDB is a SQL-based stream processing system built on top of Kafka Streams. It lets you define streaming transformations, aggregations, and joins using SQL syntax, without writing application code. It supports two query types: push queries, which return a continuous stream of results and are suited to streaming dashboards and alerting, and pull queries, which return point-in-time lookups against materialized state.

Relationship to Kafka Streams: ksqlDB compiles SQL statements into Kafka Streams topologies and runs them on ksqlDB server nodes. The state store model, changelog topics, and processing guarantees are the same as Kafka Streams; ksqlDB adds a query language and a server process.

Development activity: Contributor activity on the ksqlDB GitHub repository (github.com/confluentinc/ksql) declined materially from 2022 onward. Confluent acquired Immerok in early 2023 to accelerate a cloud-native Apache Flink offering, now marketed as Confluent Cloud for Apache Flink. Confluent's product positioning has shifted decisively toward Flink; ksqlDB receives minimal prominence in current Confluent documentation and engineering content. If you are evaluating ksqlDB for a new deployment, the reduced development momentum is a concrete factor in the decision, not an editorial judgement.

Licensing: ksqlDB is distributed under the Confluent Community License, not Apache 2.0. The Community License prohibits using the software to build a competing SaaS product. For internal deployments this is rarely a practical concern, but it is worth understanding before committing.

Choosing a stream processing layer:

Option What it is When to choose it
Kafka Streams Java/Scala client library embedded in your application Sophisticated stateful processing on Kafka data; no separate cluster needed
ksqlDB SQL interface over Kafka Streams; runs as a separate server Lower barrier for SQL-fluent teams; simpler, lower-throughput use cases
Apache Flink General-purpose stream and batch processor; separate cluster Mixed Kafka and non-Kafka sources; large-scale state; batch and streaming in one system

Kafka architecture in practice

The following companies use Kafka as a core part of their production infrastructure.

Company How they use Kafka
Apple Runs Kafka across dozens of data centres to move billions of events per day through internal analytics and ML pipelines.
Afterpay Uses Kafka as the central event bus for payment transaction processing and real-time fraud detection across its buy-now-pay-later platform.
Cash App Streams financial transaction events through Kafka to power real-time payment processing, ledger updates, and fraud signals.
Adidas Uses Kafka to unify e-commerce order events, inventory updates, and customer activity data across its global digital platform.
Notion Propagates document change events through Kafka to keep real-time collaboration, search indexing, and audit logs in sync.
Bytedance Processes trillions of user interaction events per day through Kafka to feed the recommendation engines behind TikTok and Douyin.
Datadog Ingests metrics, logs, and distributed traces at massive scale through Kafka before routing them to purpose-built storage backends.
Walmart Streams real-time inventory, order, and supply chain events through Kafka to coordinate fulfilment across thousands of stores and warehouses.
PayPal Uses Kafka to process billions of financial events per day, connecting payment processing, fraud detection, and risk scoring pipelines.
Tencent Runs one of the largest Kafka deployments in the world to handle messaging, activity data, and analytics across WeChat and its gaming platforms.
The New York Times Uses Kafka to decouple its content publishing pipeline, streaming article events from the CMS through to personalisation, search, and reader analytics.
Grab Moves real-time ride-hailing and food delivery events through Kafka to support driver-passenger matching, dynamic pricing, and operational dashboards.
Wix Uses Kafka as the event backbone for its website-building platform, streaming user and site activity data into analytics and product features.
DoorDash Streams order lifecycle events through Kafka to coordinate real-time delivery tracking, driver dispatch, and merchant notifications.
Goldman Sachs Uses Kafka to distribute market data and trade events across trading systems, risk engines, and regulatory reporting pipelines.
JPMorgan Processes financial transaction and market data events through Kafka to support real-time risk management and internal data platform services.
New Relic Uses Kafka as the ingestion backbone for its observability platform, buffering and routing telemetry data before it reaches downstream storage.
PagerDuty Routes alert and incident events through Kafka to decouple ingestion from routing logic and ensure reliable delivery at high ingest volumes.
Robinhood Streams trading, market data, and account events through Kafka to power real-time order execution, position updates, and risk controls.
Salesforce Uses Kafka to propagate CRM change events across its platform, feeding real-time automation, Einstein AI features, and customer-facing integrations.
Pinterest Moves user activity and engagement events through Kafka to train recommendation models and update real-time content ranking signals.
Reddit Streams user activity, voting, and moderation events through Kafka to power feed ranking, spam detection, and site-wide analytics.
Spotify Processes hundreds of billions of user listening events per day through Kafka to feed personalisation, recommendations, and the Discover Weekly pipeline.
Shopify Uses Kafka to stream order, merchant, and storefront events across its platform, decoupling the checkout flow from downstream fulfilment and analytics.
Netflix Runs Kafka as the central nervous system for its data pipeline, handling viewing events, operational metrics, and chaos engineering signals at global scale.
Barclays Uses Kafka to stream banking transaction events and market data across trading, risk, and regulatory reporting systems.
Airbnb Streams booking, search, and host activity events through Kafka to support real-time pricing, fraud detection, and data warehouse ingestion.
LinkedIn Created Kafka to solve its own activity stream and operational data pipeline problem; now uses it at trillion-message-per-day scale across all major platform services.
Cloudflare Processes network security events and DNS query logs at internet scale through Kafka, routing data into threat detection and analytics pipelines.
Uber Uses Kafka to stream trip, financial, and marketplace events across its global platform, supporting surge pricing, driver-partner payments, and real-time analytics.

Kafka architecture best practices

  1. Set replication factor to 3 and min.insync.replicas to 2. The default replication.factor=1 is a single point of failure. For any production topic, replication.factor=3 with min.insync.replicas=2 ensures that a write is only acknowledged after it is on at least two brokers, and that the cluster can survive a single broker failure without data loss or partition unavailability.
  2. Keep unclean.leader.election.enable=false. This is the default. For any topic where losing acknowledged writes is unacceptable, never override it cluster-wide. If you need to override it for a specific topic for availability reasons, do so at the topic level and document the trade-off explicitly.
  3. Enable idempotent producers. Set enable.idempotence=true. For exactly-once delivery across multiple topics, use the transactional API. The advice to set max.in.flight.requests.per.connection=1 for idempotent producers is outdated: since Kafka 0.11, idempotent producers safely support up to five in-flight requests per partition.
  4. Size partition count based on expected peak consumer parallelism, not just throughput. Consumer parallelism within a group is bounded by partition count. Under-partitioned topics create bottlenecks that cannot be resolved without increasing partition count (which risks breaking per-key ordering guarantees for keyed records).
  5. Use cooperative-sticky rebalancing. Set partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor on all consumer groups. Eager rebalancing stops all consumers during reassignment; cooperative-sticky is incremental and significantly reduces the disruption of consumer group membership changes.
  6. Commit consumer offsets after processing, not before. Set enable.auto.commit=false and commit offsets manually after records are fully processed. Auto-commit can commit before processing is complete, causing data loss on consumer restart.
  7. Set rack-aware replica placement. Configure broker.rack on each broker. Without this, multiple replicas of the same partition can end up on brokers in the same availability zone or physical rack, negating the fault tolerance that replication provides.
  8. Monitor the critical JMX metrics. Alert when OfflinePartitionsCount is above 0, UnderReplicatedPartitions is above 0 for more than a few minutes (allow for normal during rolling restarts), UncleanLeaderElectionsPerSec is above 0, and consumer lag grows unboundedly.
  9. Enforce schema compatibility. Even if you control both producers and consumers today, schemas drift over time. Registering schemas and enforcing BACKWARD or FULL compatibility at the registry level catches breaking changes before they reach production consumers.
  10. Use a Kafka UI for operational visibility. CLI tooling is sufficient for debugging in isolation, but for running Kafka in production you need a way to observe broker health, consumer lag, topic throughput, and cluster configuration without SSH access. A management interface is a standard part of a well-operated Kafka cluster.

Kafka management tools

If you are running Kafka in production, Kpow by Factor House is a Kafka UI built for the operational practices described in this article. It gives you real-time visibility into consumer lag, broker health, topic throughput, and cluster configuration in one place - the signals you need when something goes wrong.

Deployment options

How you run Kafka shapes your operational responsibilities, scaling limits, and cost profile.

Self-managed on VMs or bare metal: Full control over Kafka version, configuration, and hardware. The highest operational overhead: you are responsible for provisioning, upgrades, monitoring, security hardening, and failure recovery. The right choice for organizations with existing infrastructure teams, specific hardware requirements, or compliance constraints that prevent using managed services.

Kubernetes (Strimzi or Confluent Operator): Container-native deployment. Strimzi is open source (Apache 2.0) and provides Kubernetes Operators for Kafka clusters, Connect, MirrorMaker, and Bridge. Confluent Operator is a commercial product. Both abstract Kubernetes complexity for Kafka operations and are a reasonable choice for teams already running Kubernetes who want to standardize deployment tooling. The main challenge is that Kafka is a stateful system: getting persistent volume sizing, storage class selection, and pod anti-affinity right requires Kafka-specific knowledge on top of Kubernetes knowledge.

Confluent Cloud: Fully managed Kafka with proprietary add-on services (Confluent Schema Registry, hosted Connect connectors, Confluent Cloud for Apache Flink). Significantly reduces operational overhead. Higher cost per throughput unit than self-managed options. Proprietary extensions create vendor dependency. Available on AWS, Azure, and GCP.

Amazon MSK: AWS-native managed Kafka. Integrates with IAM for authentication and authorization, VPC for network isolation, and other AWS services. Kafka version support lags the Apache release schedule by several months; check the current MSK documentation for the latest supported version before planning an upgrade. MSK does not support all Kafka configurations and restricts some broker-level plugin use (relevant for some Connect connectors).

Other managed options:

  • Aiven for Apache Kafka: Multi-cloud managed service across AWS, GCP, Azure, and DigitalOcean. Tracks the Apache Kafka release schedule closely. A good option when cloud portability matters.
  • Redpanda Cloud: Redpanda is Kafka API-compatible but is a different broker implementation (written in C++, not Java). Compatible with most Kafka clients and tools. There are known API gaps relative to Apache Kafka; evaluate compatibility against your specific use case - particularly around transactions, exactly-once semantics, and any Kafka Streams or Connect usage - before committing.
Option Control Operational overhead Cost Vendor lock-in
Self-managed (VMs/bare metal) Full High Low to medium None
Kubernetes (Strimzi) Full Medium-high Low to medium None
Confluent Cloud Limited Low High High
Amazon MSK Limited Low Medium Medium (AWS)
Aiven Limited Low Medium Low
Redpanda Cloud Limited Low Medium Medium

What's changing in Kafka architecture

ZooKeeper removal (complete): ZooKeeper support was removed in Kafka 4.0, released March 18, 2025. All clusters running Kafka 4.x operate on KRaft exclusively. If you are on Kafka 3.x with ZooKeeper, migration to KRaft is required before upgrading to 4.x.

Tiered storage (KIP-405): Tiered storage allows Kafka to offload older log segments to object storage (S3, GCS, Azure Blob Storage) while keeping recent segments on local broker disk. This decouples data retention from broker disk capacity: you can retain data for months without sizing broker storage to match. Tiered storage reached General Availability in Kafka 3.6 for self-managed clusters, implemented via a pluggable RemoteStorageManager interface. Confluent Cloud has its own implementation; MSK added tiered storage support separately. Check current documentation for feature flags and known limitations in your deployment environment.

Queue semantics (KIP-932): Kafka's current partition assignment model gives each partition to exactly one consumer in a group, which preserves per-partition ordering but limits work distribution between consumers. KIP-932 proposes a "share group" model where multiple consumers can compete for records within the same partition, enabling queue-like behavior for workloads where ordering is not required and maximum consumer throughput matters. As of this writing, KIP-932 is available as an early access feature in some Kafka versions. Check the Apache Kafka KIP wiki for current status before building on it.

Eligible leader replicas (KIP-966): Previously, if the ISR for a partition became empty, the only options were waiting for a replica to recover or enabling unclean elections. KIP-966 introduced the Eligible Leader Replica (ELR) set: replicas not currently in the ISR but with a sufficient log offset to be safe candidates for election. This reduces partition unavailability in scenarios where the ISR shrinks to zero due to rolling restarts or multi-broker failure, without the data loss risk of unclean elections.

For a complete list of active proposals affecting Kafka's core architecture, the Apache Kafka KIP wiki is the authoritative reference.