This post explores a breaking change to Apache Kafka producer behaviour, introduced in Kafka 3.2.0.
Update: We raised a PR to the Apache Kafka site and this information is now included in the upgrade-notes for Kafka 3.2.0.
Apache Kafka 3.2.0 implements KIP-679 that changes the default behaviour of Producer configuration to enable idempotence by default.
This change can cause message production to fail after you update to the 3.2.0 kafka-client libraries, it briefly impacted Kpow v88.6 (fixed in v88.7).
Originally released in Kafka 3.0.0 via KAFKA-10619, a bug in config validation meant this change was not fully implemented until KAFKA-13598 in Kafka 3.2.0.
It appears that issues were identified with the idea of changing default behaviour as KAFKA-13673 disables default idempotency when certain configuration is set on the producer, and KAFKA-13759 disables this change entirely for Kafka Connect. The cause of the issue that briefly impacted Kpow can be found in that last ticket:
> for brokers older than version 2.8 the IDEMPOTENT_WRITE ACL is required to be granted to the principal
In this post we explore:
- Identifying this issue in Kpow
- Details of the breaking change
- Potential scope of impact
- Issue remediation
- Further implications
What is Kpow?
Kpow provides enterprise-grade monitoring, management, and control of Apache Kafka Clusters, Schema Registries, and Connect installations.
Uniquely for a product in use since 2018, Kpow is built for and from Kafka. We use Kafka Streams and internal topics for system state and long-term metrics computation (e.g. topic last-write, group last-read telemetry, etc). We use Kpow to monitor and build Kpow, it's turtles all the way down.
The combination of being widely used and well integrated with Kafka means we often have user-reports of issues before they become commonly known, as can be seen from our recent blog post on memory issues with Amazon Corretto 11. See the Kpow Changelog for full release notes.
Kpow Data Inspect UI
Impact on Kpow
Within one day of releasing Kpow v88.6 we received a user report that Kpow was showing the following error in the application logs:
10:06:29.278 ERROR [OperatrScheduler_Worker-2] operatr.compute.v3.materialization -- Error creating simple metrics for a7ca88d8-e1a7-4dec-9ae0-7248f5624c1d :cluster lkc-xxxx org.apache.kafka.common.KafkaException: Cannot execute transactional method because we are in an error state at org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:1125) at org.apache.kafka.clients.producer.internals.TransactionManager.maybeAddPartition(TransactionManager.java:442) at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:998) <...snip...> at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) Caused by: org.apache.kafka.common.errors.ClusterAuthorizationException: Cluster authorization failed.
Unfortunately the error log contained this red herring:
Caused by: org.apache.kafka.common.errors.ClusterAuthorizationException: Cluster authorization failed.
Kpow runs with every flavour of Kafka from v1.0.0 onwards, yes there are teams in the wild still using Kafka v1.0.0.
Our product monitors and manages Kafka clusters in organizations from publishing to payment networks. Where a cluster is protected by ACLs the Kpow user must have the minimum set of ACL permissions required for our product to work. When you don't have the correct permissions configured, you see that error.
We asked the user to check their ACL permissions, normally this quickly resolves the issue. They insist their permissions are correct. We run up a local 3-node Kafka cluster with ACLs configured (see our docker-compose configuration for a local SASL authenticated 3-node Kafka cluster that allows ACL testing) and run Kpow without encountering any issue. This is a proper head scratcher. Then two things happen at roughly the same time.
Firstly we noticed this part of the error log:
org.apache.kafka.common.KafkaException: Cannot execute transactional method because we are in an error state
We don't use transactional producers. We're aware of what they are and the role they play, but we have no need for them and have not configured them.
Then our tenacious user came back to us with an update:
> I think there's a bug in the 88.6 version.
> I rolled back to the 88.5 with the same set of permissions listed in this email and I'm able to see the metrics
We closed a record 49 minor issue tickets in Kpow v88.6. The majority were either old and redundant, or tweaks to our UI/UX. There were zero changes related to message production other than this seemingly innocuous library version bump:
<< [org.apache.kafka/kafka-streams "3.1.0"] >> [org.apache.kafka/kafka-streams "3.2.0"]
We favour keeping close to latest of major libraries like Apache Kafka since they offer great quality and reliability.
That said we are fastidious about our dependency management. We read release notes and upgrade guides where available, changelogs if they exist, and if required we'll look at the commit diff between different versions of a library to determine if it is safe to proceed. We don't read every KIP though.
In this case if there was a note on a breaking change, we missed it. Reverting the Kafka-Streams library back to 3.1.0 fixed our issue and we released Kpow v88.7.
Breaking Change Details
The switch to default idempotent Producers can cause production to fail where:
- The Kafka Cluster has brokers running version < 2.8.0, and
- The Kafka Cluster has ACLs configured, but not IDEMPOTENT_WRITE and
- Producer configuration is default, or is capable of being defaulted to idempotent, and
- The producing application is using Kafka-Clients version > 3.2.0
This is because in Kafka prior to v2.8.0 there was an ACL specifically for idempotent production named IDEMPOTENT_WRITE
.
If you are not concerned with idempotency and have ACL set it is likely that IDEMPOTENT_WRITE
is false, and TOPIC_WRITE
is set instead.
If your producers have default configuration, or are not explicitly idempotent but fall within the bounds of KAFKA-13673, and you update the client libraries to > 3.2.0, you will now have idempotent producers and they will fail to write with the ACL error that we encountered.
Potential Scope of Impact
Perhaps Kpow is an uncommon application, being widely used against nearly every sort of Kafka cluster.
However if we take another look at the conditions for the breaking change to occur:
- The Kafka Cluster has brokers running version < 2.8.0. This is fairly common.
- The Kafka Cluster has ACLs configured but not IDEMPOTENT_WRITE. This is fairly common.
- Producers configuration is default, or is capable of being defaulted to idempotent. This is fairly common.
- The producing application is using Kafka-Clients version > 3.2.0. This is very easy to do.
The first three circumstances are more common than you might expect, and bumping a Kafka client library in your application dependencies is really easy to do. Much easier than upgrading the version of your Kafka brokers, for instance.
Issue Remediation
If you encounter this issue it is fairly easy to remediate, you have several options:
- Rollback your kafka-client library to < 3.2.0, or
- Configure producer idempotency to false, or
- Configure
IDEMPOTENT_WRITE
ACLs, or - Upgrade your broker version to > 2.8.0
Further Considerations
This change to Kafka leaves us in a slightly strange state. Is your producer default idempotent or not? The answer is maybe. If you haven't configured anything contrary to idempotency then yes, it should be idempotent. That's a weird default position to hold that requires further knowledge from a user than I would expect.
If enterprise-grade Apache Kafka tooling with a focus on performance and reliability interests you, sign up for a free 30-day trial today.