Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

Kafka cluster management: A practical guide for engineers

Table of contents

Factor House
May 9th, 2026
xx min read

Key takeaways

  • The decisions you make at cluster creation time, specifically partition counts, replication factor, broker sizing, and KRaft vs ZooKeeper, determine whether you scale cleanly or accumulate operational debt that is difficult to unwind.
  • Apache Kafka 4.0 (released March 2025) removed ZooKeeper entirely. KRaft is now the only supported mode. If you are still running ZooKeeper-based Kafka 3.x, your migration path is mandatory.
  • The most impactful monitoring signals are under-replicated partitions (URPs), ActiveControllerCount, OfflinePartitionsCount, and consumer lag trend. Alert on these; log everything else.
  • Producer compression combined with sensible batching is the highest-leverage performance tuning available without architectural changes.
  • Tools like Kpow can provide visibility into partition health and URP status, and allow you to manage partition reassignments and leader elections without dropping to the CLI.

Understanding Kafka at a conceptual level is one thing. Operating it in production is something else entirely. At a handful of topics and a few brokers, the configuration feels manageable. Once you are running dozens of topics across multiple teams, with different retention requirements, latency SLOs, and replication topologies, the operational surface area expands considerably. Partition count decisions become difficult to reverse. Monitoring gaps surface during incidents rather than during planning. Rebalancing a cluster without causing ISR shrinkage requires care. None of this is obvious from the documentation alone.

This guide covers the practical decisions and operational patterns that matter most when running Kafka clusters at scale: how to size and structure clusters from day one, what to watch for in production, how to tune producers, consumers, and brokers, and how the shift to KRaft has changed the management picture.

What is Kafka cluster management?

Kafka cluster management covers the end-to-end operational work of keeping one or more Kafka clusters running correctly: provisioning and sizing brokers, choosing partition strategies and replication settings, managing consumer groups and offsets, handling rolling operations and partition reassignments, monitoring for cluster health, tuning configuration for performance, enforcing access control, and planning for the architectural shifts like the ZooKeeper-to-KRaft migration.

It is not a single task but a continuous discipline. The configuration choices you make at topic creation time constrain what you can do later. The monitoring infrastructure you build before incidents determines how quickly you can diagnose them. And the operational practices you establish early, things like how you handle rolling restarts, how you reassign partitions, and how you gate cluster mutations, determine whether your cluster remains manageable as it grows.

How architecture and sizing decisions affect cluster management

Start with three brokers across three availability zones

The baseline production configuration is three brokers, one per availability zone, with replication.factor=3 and min.insync.replicas=2. This is the smallest topology that survives a single broker failure while still allowing producers with acks=all to make progress. RF=2 only survives zero failures during maintenance, which is not a reasonable production posture. Note that increasing the replication factor on a live topic adds meaningful network and disk pressure. Get it right at topic creation.

For storage, the choice is straightforward for latency-sensitive workloads: NVMe/SSD or fast provisioned EBS (gp3 or io2 on AWS). HDD is acceptable only for cold archive workloads where tiered storage is in use. A single historical consumer reading from disk on a shared HDD broker can cause a 43% drop in producer throughput, based on Stanislav Kozlovski's analysis of KIP-405 storage trade-offs. The memory guidance is consistent across most large operators: 4-8 GB JVM heap, with the remainder of RAM left to the OS page cache.

Partition count is the most consequential and least reversible decision

For keyed topics, you cannot increase the partition count later without breaking hash(key) % num_partitions ordering. Confluent's official guidance states up to approximately 4,000 partitions per broker and 200,000 per cluster on ZooKeeper-based deployments. On KRaft, lab tests demonstrated stable operation at 2 million partitions per cluster, though real-world production clusters typically run in the hundreds of thousands.

For sizing, Confluent's formula is a reasonable starting point: minimum partitions = max(target_throughput / per_partition_producer_throughput, target_throughput / per_partition_consumer_throughput). On modern hardware with LZ4 compression and acks=all, single-partition producer throughput is typically 10-50 MB/s; consumer throughput is 50-100+ MB/s. Add 2-3x headroom for hot-key skew.

For small clusters under six brokers, starting at approximately 3x the broker count in partitions is a reasonable rule. For larger clusters above twelve brokers, 2x the broker count tends to work. Do not under-partition either: a single-partition topic caps you at one consumer instance for that topic-level parallelism.

Cluster isolation by workload type

LinkedIn, Netflix, Pinterest, and Cloudflare all operate many smaller clusters rather than a single large one. Netflix explicitly maintains a rule of a maximum of 200 brokers per cluster, preferring more clusters of moderate size. The operational argument is compelling: a single large cluster creates a blast radius that affects every consumer and producer on it. Isolating by traffic class (tracking, logging, metrics, CDC) limits the impact of a misconfigured producer or a runaway topic.

Multi-cluster topology comes before tool selection

If you are planning for geo-replication or disaster recovery, choose your topology first: active-passive, active-active, or aggregation. Then select the tool. MirrorMaker 2 (MM2) is the open-source option; it runs on Kafka Connect and handles topic replication via MirrorSourceConnector, offset translation via MirrorCheckpointConnector, and heartbeats via MirrorHeartbeatConnector. The main operational catch is that offset sync is periodic (default 60 seconds), so consumers may reprocess up to that interval's messages on failover. Make your consumers idempotent regardless of which replication tool you use, and test failover at least quarterly. At LinkedIn or Uber scale, MM2 becomes an operational burden; purpose-built tools like Brooklin (LinkedIn) or uReplicator (Uber) were built to handle per-partition error isolation that MM2 cannot provide.

For Confluent Platform or Cloud users, Cluster Linking provides byte-for-byte replication that preserves offsets exactly, at the cost of vendor lock-in.

Day-to-day cluster operations

Rolling restarts

Rolling restarts are the standard approach for applying broker configuration changes or upgrades without downtime. The procedure is straightforward: drain, restart, wait for ISR recovery, repeat. Before restarting a broker, confirm that UnderReplicatedPartitions is zero cluster-wide. If it is not, wait. After restarting, wait for the broker to fully rejoin the ISR before proceeding to the next node. Restarting a broker while another is still recovering from a previous restart risks losing min.insync.replicas headroom, which will cause producers with acks=all to start receiving NotEnoughReplicasException.

# Check cluster-wide under-replicated partitions before each step

kafka-topics.sh --bootstrap-server localhost:9092 \

  --describe --under-replicated-partitions

# Graceful broker shutdown

kafka-server-stop.sh

Partition reassignment

Adding brokers to a cluster does not automatically rebalance existing partitions. Only new partitions land on new brokers by default. After adding capacity, you need to explicitly reassign partitions using either Cruise Control (LinkedIn's open-source rebalancing tool, the production standard) or kafka-reassign-partitions.sh. Always throttle reassignments to avoid saturating broker network and triggering ISR shrinkage during the move.

# Generate a reassignment plan for specified topics

kafka-reassign-partitions.sh \

  --bootstrap-server localhost:9092 \

  --topics-to-move-json-file topics.json \

  --broker-list "0,1,2,3" \

  --generate

# Execute with throttle (bytes/sec)

kafka-reassign-partitions.sh \

  --bootstrap-server localhost:9092 \

  --reassignment-json-file reassign.json \

  --throttle 50000000 \

  --execute

# Verify completion

kafka-reassign-partitions.sh \

  --bootstrap-server localhost:9092 \

  --reassignment-json-file reassign.json \

  --verify

Cruise Control automates this with goal-driven rebalancing: it accounts for rack awareness, leader replica distribution, bytes-in distribution, and disk capacity targets, reducing the manual overhead significantly on clusters with dozens of brokers.

Consumer group and offset management

The most common operational tasks around consumer groups are offset resets (after a deployment rollback, a poison message event, or a schema change) and lag investigation. Disable auto-commit for any consumer doing meaningful processing. The robust pattern is enable.auto.commit=false, process the message, then commit explicitly after the result is durable.

For investigating consumer group state:

# List all consumer groups

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Describe a specific group, including lag per partition

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \

  --group my-consumer-group --describe

# Reset offsets to latest (requires group to be inactive)

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \

  --group my-consumer-group \

  --topic my-topic \

  --reset-offsets --to-latest \

  --execute

Static membership (KIP-345) is worth enabling for stateful consumers. Set group.instance.id to a stable per-instance identifier. When the consumer restarts, the broker waits for session.timeout.ms before triggering a rebalance, and when the consumer rejoins with the same instance ID, it receives the same partition assignments back. This eliminates rebalances caused by rolling deployments. When using static membership, increase session.timeout.ms to 10-30 minutes to prevent false-positive rebalances.

For the assignor, switch to CooperativeStickyAssignor (KIP-429, available since Kafka 2.4) to avoid the stop-the-world rebalance behaviour of the eager protocol. On Kafka 4.0+, the next-generation rebalance protocol (KIP-848) is GA and moves rebalance coordination entirely server-side, measured by Instaclustr at up to 20x faster than eager rebalancing.

Observing clusters for health

Effective cluster monitoring distinguishes between three things: what you need to page on, what warrants a ticket, and what you log and review periodically.

Alert-worthy signals

These are the metrics that indicate active cluster failure or imminent unavailability:

  • kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions: Sustained non-zero across many brokers indicates a broker is offline or severely degraded. This is the single most critical cluster-level metric.
  • kafka.controller:type=KafkaController,name=ActiveControllerCount: Must be exactly 1 cluster-wide. A value of 0 or greater than 1 requires immediate investigation.
  • kafka.controller:type=KafkaController,name=OfflinePartitionsCount: Must be 0. Any offline partitions mean producers and consumers cannot access that data.
  • kafka.cluster:type=Partition,name=UnderMinIsr: Partitions below min.insync.replicas. Producers with acks=all cannot write to these partitions.
  • Consumer lag trending upward: A consumer that is falling further behind over time, not just experiencing a momentary spike, is an application-level incident.

For consumer lag monitoring, LinkedIn's Burrow is the most robust approach in the open-source space. Rather than alerting on raw offset distance (which produces false positives during batch catch-up), Burrow evaluates lag over a sliding window of commits and classifies consumer groups as OK, WARNING, or ERROR based on whether the lag trend is decreasing, stable, or growing. This significantly reduces alert fatigue. For more detail on consumer lag monitoring strategies and tooling, see how to monitor Kafka consumer lag.

Ticket-worthy signals

These indicate degradation worth investigating before it becomes a page:

  • kafka.server:type=ReplicaManager,name=IsrShrinksPerSec / IsrExpandsPerSec: Frequent ISR shrink/expand events on a single broker usually mean that broker has a disk or network issue. Sustained shrinks across all brokers usually mean aggregate load or a network problem.
  • kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent: Below 0.7 (30% utilisation) means performance is degrading. Below 0.5 means the broker is approaching saturation.

Monitoring tooling

The standard production stack is JMX exported to Prometheus via the JMX Exporter, visualised in Grafana. OpenTelemetry Collector with the JMX receiver is increasingly common as an alternative. For companies already running Datadog, the Kafka integration covers the core broker and consumer metrics. Confluent Control Center is the natural choice on Confluent Platform.

For visibility into cluster health and Kafka monitoring more broadly, the key principle from engineering teams at Cloudflare and Pinterest is to avoid alert fatigue: broad threshold alerts on per-topic metrics fire too frequently to be actionable. Focus alerts on the signals listed above, and treat per-topic byte rates and partition commit latencies as log-level observability.

Tuning producers, consumers, and brokers for performance

Producer configuration

Compression combined with larger batch sizes is the highest-leverage producer tuning available. Cloudflare's production data is instructive: Snappy plus approximately 1-second batching reduced a metrics topic from 800 Mbps to 170 Mbps. At 100 TB/month, a 3x compression ratio represents a 67% cost reduction on network egress, replication, storage, and consumer transfer.

Key producer settings:

# Compression: lz4 for speed, zstd for ratio

compression.type=lz4

# Larger batch size (default 16 KB is too small for high throughput)

batch.size=131072

# Wait for batches to fill before sending

linger.ms=10

# Durability

acks=all

enable.idempotence=true

# Compatible with idempotence

max.in.flight.requests.per.connection=5

One critical gotcha: if your producer compression.type does not match the topic's compression.type, the broker decompresses and recompresses every batch. Set the topic-level configuration to producer to pass through the producer's codec, or ensure they match explicitly.

For schema format, Avro and Protobuf compress dramatically better than JSON and have cheaper deserialisation. Shopify and Cloudflare both use Protobuf as their standard wire format.

Consumer configuration

# Process explicitly, commit explicitly

enable.auto.commit=false

# Tune for your processing capacity

max.poll.records=500

# Batch reading for throughput

fetch.min.bytes=1024

# Cooperative rebalancing

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

# Static membership (for stateful consumers)

group.instance.id=consumer-instance-1

session.timeout.ms=600000

Set max.poll.records to a value you can reliably process within max.poll.interval.ms. If processing a batch takes longer than max.poll.interval.ms, the consumer is removed from the group and a rebalance triggers.

Broker configuration

# Increase for high-connection environments

num.network.threads=8

# Scale with disk count

num.io.threads=16

# Larger socket buffers for high-bandwidth networks

socket.send.buffer.bytes=1048576

socket.receive.buffer.bytes=1048576

Leave log.flush.interval.messages and log.flush.interval.ms at their defaults. Kafka relies on OS page cache and replication for durability. Forcing fsync eliminates that benefit and significantly reduces throughput.

At the OS level, set vm.swappiness=1 to prevent the kernel from consuming page cache for swap, vm.max_map_count=262144 or higher (each log segment requires two memory-mapped areas, and the default of 65536 is quickly exhausted on brokers with thousands of partitions), and ensure file descriptor limits are set to 100,000 or above. Mount filesystems with noatime and prefer XFS for large disks.

JVM heap should be 6-8 GB with G1GC. Never exceed 32 GB or you lose compressed ordinary object pointers, which has a meaningful impact on GC overhead.

Security and multi-tenancy

ACLs

Kafka ACLs bind principals (users, service accounts) to resources (topics, consumer groups, clusters) with specific operations (Read, Write, Create, Describe, Alter, Delete). They work well at moderate scale. As the number of teams and topics grows, the number of ACL rules grows correspondingly, and ACL management can become a significant operational overhead.

# Grant a service account read access to a topic

kafka-acls.sh --bootstrap-server localhost:9092 \

  --add \

  --allow-principal User:my-service \

  --operation Read \

  --topic my-topic

# List ACLs for a topic

kafka-acls.sh --bootstrap-server localhost:9092 \

  --list --topic my-topic

Multi-tenancy patterns

At the cluster level, multi-tenancy in Kafka typically means some combination of namespace conventions for topics (for example, team-name.topic-name), ACLs restricting access to namespaced resources, and quota configurations to prevent any single tenant from saturating broker network or request handler threads.

Kafka quotas allow you to set byte-rate limits and request-rate limits per client ID or user principal:

# Set a producer byte-rate quota for a client

kafka-configs.sh --bootstrap-server localhost:9092 \

  --alter \

  --add-config 'producer_byte_rate=10485760' \

  --entity-type clients \

  --entity-name my-producer-client

For larger organisations, dedicated clusters per business domain or per environment (development, staging, production) provide better blast radius containment than a multi-tenant shared cluster.

KRaft and what it changes for cluster management

Why ZooKeeper was removed

ZooKeeper was a separately operated dependency: a second cluster to deploy, secure, monitor, and patch. It also became a scaling bottleneck. Every metadata change, broker registration, partition leader election, ACL update, and topic configuration update, flowed through ZooKeeper. For a 10,000-partition cluster, controller failover required reading all partition metadata from ZooKeeper during initialisation, adding approximately 20 seconds to the unavailability window during an unclean failure.

How KRaft works

KRaft replaces ZooKeeper with a Raft-based consensus protocol implemented inside Kafka itself. A dedicated controller quorum, typically three or five nodes, maintains an internal __cluster_metadata topic where each metadata change is written as an event. Brokers consume this metadata stream like a regular Kafka topic. Controller failover is near-instantaneous because the incoming active controller already has all committed metadata in memory. Confluent migrated all Confluent Cloud clusters to KRaft without customer-visible downtime. Aiven migrated 15,000 servers over three months with zero downtime.

The operational benefits are concrete: Confluent reports approximately 40% reduction in cluster setup time and the ability to scale to millions of partitions in a single cluster (demonstrated at 2 million in lab conditions).

Migration path if you are still on ZooKeeper

Kafka 4.0 (March 2025) removed ZooKeeper support entirely. If you are running Kafka 3.x with ZooKeeper, your path is:

  1. Upgrade to Kafka 3.9 (the bridge release with the most complete migration tooling).
  2. Provision a dedicated KRaft controller quorum with zookeeper.metadata.migration.enable=true.
  3. Roll restart brokers in migration mode. The active KRaft controller copies all metadata from ZooKeeper to __cluster_metadata.
  4. Complete the dual-write phase where KRaft is authoritative and ZooKeeper is a safety net. You can still roll back at this point.
  5. Finalize: reconfigure brokers for KRaft only. No rollback is possible after this step.

Use kafka-migration-check for preflight and status checks, and monitor progress via the JMX MBean kafka.controller:type=KafkaController,name=ZkMigrationState. The final state should be MIGRATION_COMPLETED.

Budget two to four weeks of dry-run testing in a non-production environment and one to three months for a full production rollout depending on cluster count.

Key constraints: migration only supports isolated mode (dedicated controllers), not combined mode. JBOD support arrived in Kafka 3.7, so older KRaft versions did not support multiple log directories. Kafka 4.0 clients require brokers at version 2.1 or higher; upgrade all clients before moving brokers to 4.x.

Managing topic partitions with Kpow

Kpow provides a UI layer for the partition management operations that are otherwise handled entirely through the CLI. This is particularly useful in environments where you want to limit direct CLI access for most operators, or where you need an audit trail of partition mutations.

Topic and partition creation

Kpow's topic creation form lets you specify partition count, replication factor, and any additional configuration values. If you are not sure what configuration values are appropriate, the interface surfaces documentation inline, including the top five most common values currently set across your cluster for each parameter. This can be useful when creating topics that should match the configuration patterns already established for similar topics.

If you prefer not to grant Kpow mutation permissions but still want the UI for planning, the form generates an equivalent kafka-topics.sh command that updates reactively as you fill in the form.

Partition reassignment and leader election

From the Topics page, Kpow supports partition reassignment at the individual partition level: you select a topic partition, click the reassignment action, and choose target replicas. In-progress reassignments are visible in a dedicated Reassignment tab, where you can also cancel any in-flight reassignment. Full cluster and full topic reassignment support is on the roadmap.

For leader election, Kpow supports both preferred and unclean election types. Preferred election is the standard operation for rebalancing leader distribution after a rolling restart. Unclean election is a last-resort option that allows a replica not in the ISR to become leader, with the trade-off of potential data loss.

URP detection

Kpow provides enhanced URP detection that correctly identifies partitions where the in-sync replica count is below the configured replication factor, even when a broker is offline and not visible to the AdminClient. URP totals are displayed on both the Brokers and Topics pages, and if the count is greater than zero, a detailed table expands to list every affected topic and partition. This makes it practical to identify URP scope quickly during an incident without running kafka-topics.sh --under-replicated-partitions manually across multiple brokers.

Consumer group management

Kpow's Groups UI handles consumer group offset management across the full range of dimensions: whole-group, host-level, topic-level, and partition-level. Offset resets are scheduled and held until the consumer group reaches an EMPTY state, which prevents modifications to an active group from causing consistency issues. For static consumer groups, you can also remove individual members to trigger a rebalance without waiting for session.timeout.ms to expire.

Moving from reactive to proactive cluster management

Most Kafka problems do not arrive suddenly. ISR shrink rates creep up before brokers start failing. Consumer lag trends upward before an application team reports an issue. Request handler idle percentages decline gradually before a broker becomes saturated. The pattern is almost always visible in the metrics before it becomes an incident, provided you are looking in the right place with the right resolution.

The practical shift toward proactive cluster management involves three things. First, instrumenting the signals that matter at the cluster level, the ones listed in the monitoring section above, and ensuring they flow into an alerting layer before they become visible to end users. Second, running rebalancing and maintenance operations on a schedule rather than in response to imbalance. Tools like Cruise Control are designed to run continuously and maintain goal-based cluster health rather than waiting for manual intervention. Third, maintaining an observation layer that gives you a complete picture of partition health, consumer group state, and broker configuration across all clusters in one place, rather than assembling that picture from multiple CLI commands during an incident.

If you want to explore what a unified observation layer looks like in practice, Kpow offers a free 30-day trial that you can connect to any Kafka cluster in minutes and deploy via Docker, Helm, or JAR.

Checklist: greenfield cluster in 2026

For reference, here are the baseline configuration choices validated against the practices of teams at LinkedIn, Pinterest, Netflix, Cloudflare, and Stripe:

  • Kafka 4.0+, KRaft only, isolated mode: three dedicated controllers plus three or more brokers across three AZs.
  • replication.factor=3, min.insync.replicas=2, acks=all, enable.idempotence=true.
  • NVMe/SSD or provisioned EBS; 6-8 GB JVM heap; rest to OS page cache.
  • LZ4 producer compression, batch.size=131072, linger.ms=10-20.
  • CooperativeStickyAssignor and group.instance.id on every stateful consumer.
  • Cruise Control (or equivalent) for rebalancing from day one.
  • JMX to Prometheus to Grafana, plus Burrow for consumer lag trend analysis.
  • Tiered storage enabled at topic creation for analytics or long-retention topics.
  • enable.auto.commit=false; commit offsets explicitly after processing is durable.
  • Avro or Protobuf with a Schema Registry.

Scale up brokers when sustained BytesInPerSec plus ReplicationBytesInPerSec exceeds 60% of NIC capacity, or when RequestHandlerAvgIdlePercent drops below 0.5. Consider adding a second cluster when approaching 200 brokers in one cluster, when a single noisy tenant repeatedly causes cross-tenant impact, or when regulatory data residency requirements make cluster isolation necessary.