Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How PagerDuty uses Apache Kafka in production

Table of contents

Factor House
May 16th, 2026
xx min read

PagerDuty operates one of the more instructive Kafka deployments in the incident management space: Apache Kafka underpins at least four distinct production systems, each with its own partitioning strategy, language runtime, and failure mode. The engineering blog documents these systems in unusual detail, including a candid post-incident report from August 2025 in which a single API quirk in the pekko-connectors-kafka library generated 4.2 million new Kafka producers per hour, 84 times the normal rate, exhausting JVM heap across the cluster and rejecting roughly 95% of incoming events for 38 minutes.

Kafka's role at PagerDuty is to decouple a high-throughput alerting platform from the downstream systems responsible for processing events, scheduling notifications, and delivering them to on-call engineers, often under strict latency requirements.

Company overview

PagerDuty provides incident management and on-call scheduling software used by operations teams to detect, triage, and respond to production incidents. The platform handles hundreds of millions of API requests daily from customer monitoring systems, and downstream must reliably reach the right on-call engineer through the right contact method, at the right time.

PagerDuty's adoption of Kafka was not a single architectural decision but a gradual replacement of earlier queuing approaches. The original notification system, called Artemis, used Apache Cassandra as a polling-based message queue. By 2017, PagerDuty engineers were publishing blog posts describing a Kafka-backed distributed task scheduler. By 2018, a senior engineer presented a Kafka Summit talk describing the migration of the core event ingestion pipeline from Cassandra-based queuing to Kafka. By January 2019, the Notification Scheduling Service (NSS) had replaced Artemis entirely and was live in production.

Date Event
April 2017 David Van Geest publishes the first post in the "Distributed Task Scheduling with Akka, Kafka, Cassandra" series, describing the Scheduler architecture
August 2017 Part 2 of the Scheduler series published, covering dynamic load handling
October 2018 Chris Morris presents "From Propeller to Jet" at Kafka Summit San Francisco, describing the migration from Cassandra queuing to Kafka for event ingestion
December 2018 PagerDuty/scheduler open-sourced on GitHub
January 2019 Notification Scheduling Service (NSS), the Elixir/Kafka replacement for Artemis, goes live
August 2019 Jon Grieman presents the Elixir + CQRS incident timeline service at ElixirConf; notes it has been in production for three years
May 2020 Tanvir Pathan publishes "Writing Intelligent Health Checks for Kafka Services"
June 2020 Elora Burns publishes "Elixir at PagerDuty: Faster Processing with Stateful Services", documenting NSS architecture and performance results
May 2021 Tamim Khan publishes post-mortem of the dentry cache investigation affecting Kafka hosts in staging
28 August 2025 Two Kafka outages (03:53 UTC and 16:38 UTC) disrupt event ingestion for US customers due to a pekko-connectors-kafka producer proliferation bug
4 September 2025 PagerDuty publishes a full post-incident report for the August 28 outages

PagerDuty's Kafka use cases

PagerDuty uses Kafka across four production systems. Each serves a distinct function and was built by a different engineering team, which explains the variation in language runtimes, partitioning strategies, and architectural patterns across the deployment.

Event ingestion pipeline

Kafka is described by PagerDuty engineers as "the backbone of our decoupled, async architecture." Customer monitoring systems publish alerts through PagerDuty's Events API; those events are written to Kafka topics and consumed by downstream services that handle incident creation, escalation routing, and notification dispatch.

The Event Ingestion Admin (EIA) service sits in this pipeline. It consumes from multiple Kafka topics, including incoming_events and failed_events, and persists event state to ElastiCache for observability. The EIA service is written in Elixir and uses a GenServer-based consumer that polls partition offsets every 10 seconds to detect consumer stall.

Notification Scheduling Service

The Notification Scheduling Service (NSS) replaced the earlier Artemis system, which used Cassandra as a polling queue. Artemis required all processing to run in a single US West Coast datacenter to keep Cassandra query latency acceptable. NSS, built in Elixir on the BEAM runtime, uses Kafka to distribute notification traffic while decoupling message distribution from datacenter placement.

NSS publishes three categories of message across three parallel topic sets: updates to incident assignments and state, updates to user contact methods and notification rules, and feedback from downstream systems such as telephony confirmations. Users are consistently routed to the same partition set across all three topic categories, which allows a single Elixir process to hold all state for a given user in memory without cross-process synchronisation.

Elora Burns documented the results when NSS went live in January 2019: 10x throughput, half the lag time, and one-tenth the compute and storage footprint compared to Artemis.

Distributed task scheduler

The Scheduler library, open-sourced by PagerDuty in December 2018, uses Kafka as its queuing and partitioning layer for scheduled task execution. Services publish serialised task metadata to a Kafka topic; the Scheduler library consumes those messages and executes each task at its scheduled time. Use cases include emailing an engineer two days before an on-call shift and push-notifying a responder one minute after an incident opens.

Cassandra provides durable task persistence alongside Kafka: tasks are written to Cassandra at publication time and removed on completion. The Kafka message carries enough information to locate the persisted task record; Cassandra is the source of truth for durability.

Incident timeline CQRS service

An Elixir service powering PagerDuty's incident timeline entries uses Kafka with a CQRS pattern. This service refactored a component of the Rails monolith into an Elixir umbrella application backed by Kafka for event streaming. Jon Grieman described it at ElixirConf 2019 as one of PagerDuty's most frequently used services at that time, and noted it had already been in production for three years.

Scale and throughput

PagerDuty does not publish aggregate cluster-level metrics, but several concrete figures appear across the engineering blog posts and the August 2025 incident report.

Metric Value Source
Partitions per topic (EIA service) 64-100 Tanvir Pathan, May 2020
Kafka producer rate at incident peak 4.2 million new producers/hour (84x normal) PagerDuty incident report, September 2025
Event rejection rate at peak ~95% of create event requests rejected PagerDuty incident report, September 2025
Sustained error rate 18.87% of create event requests returning 5xx PagerDuty incident report, September 2025
Cluster partition limit (noted as a constraint) ~50,000 partitions per cluster Elora Burns, June 2020
Scheduler topic partition count "Low hundreds" at provisioning time PagerDuty/scheduler GitHub, design.md
NSS throughput improvement over Artemis 10x throughput, 0.5x lag, 0.1x compute and storage Elora Burns, June 2020

No absolute messages-per-second or bytes-per-second figures have been published. No total topic count or cluster count is disclosed publicly.

PagerDuty's Kafka architecture

Deployment model

PagerDuty operates a multi-datacenter deployment. The Scheduler documentation describes runbooks for isolating a degraded datacenter by sequentially shutting down application nodes, Kafka brokers, and Cassandra nodes in the affected site, then allowing the healthy datacenter to recover. The NSS post notes that decoupling message distribution from datacenter placement was a primary motivation for replacing Artemis.

Multiple Kafka clusters are in use. The Scheduler documentation references a shared internal "bitpipe" Kafka cluster offered as a shared resource, separate from per-service clusters. Individual topics use partitioned replication with leader/follower broker configuration.

Broader data platform

Kafka acts as the async messaging layer between the Events API and downstream processing services, including the EIA service, NSS, and the incident timeline service. Cassandra provides durable persistence alongside Kafka in both the Scheduler and the earlier Artemis system. ElastiCache (Redis) stores ingested event state consumed from EIA Kafka topics. The JVM-side producer code uses Akka (now Pekko), while NSS and the EIA service run on Elixir/BEAM.

Producer architecture

The Events API service uses pekko-connectors-kafka for its Kafka producer abstraction. PagerDuty's August 2025 incident traced directly to how this library handles producer instantiation: passing Kafka settings to the library creates a new producer instance silently, without a new keyword, rather than reusing an existing one. This API design was characterised in the incident report as a "visibility gap" that made the bug difficult to catch in code review.

The Transactional Outbox pattern is used to guarantee at-least-once delivery. Messages are written to a database table as part of the originating database transaction; a scraper then reads pending messages and publishes them to Kafka. During the August 2025 outage, this ensured no event data was lost even while the Kafka tier was degraded.

Consumer architecture

The Scheduler library implements a single-threaded Kafka consumer poll loop. All partition reassignment callbacks (onPartitionsRevoked, onPartitionsAssigned) execute within this poll thread, enabling synchronous blocking during rebalances. On partition revocation, the actor system shuts down, in-flight tasks complete, and database connections close before the partition is yielded. This design avoids concurrency issues during rebalancing at the cost of a simpler execution model.

The EIA service uses a GenServer-based consumer in Elixir. The health check polls Kafka partition offsets every 10 seconds and compares the result against a stored snapshot to detect consumer stall, independent of message volume. Consul pings the health endpoint on the same 10-second interval.

Stream processing

ksqlDB is referenced in the August 2025 incident report as part of the Kafka tier, but no further architectural detail is available in published sources.

Special techniques and engineering decisions

Consistent user-to-partition routing in NSS

The NSS design routes each user to a fixed set of Kafka partitions across all three parallel topic sets. The mapping is static: all traffic for a given user arrives on the same partitions regardless of which topic set carries the message. This allows a single Elixir Partition Owner process to hold the complete state for each user in memory, eliminating cross-process synchronisation. The approach accepts the constraint that repartitioning requires redeployment in exchange for simplicity in the stateful consumer design.

Partition-locked serial task execution in the Scheduler

The Scheduler library assigns tasks to partitions using a modulo-based Partitioner class keyed on orderingId. All tasks for the same orderingId route to the same partition and are processed serially by a single node. This guarantees ordering within a task group without any distributed locking. The partition count is fixed at provisioning time in the "low hundreds" because Kafka does not support changing partition count at runtime for an existing topic.

Transactional outbox for delivery guarantees

PagerDuty applies the Transactional Outbox pattern as a reliability layer between its application databases and Kafka. Messages are written to a database table within the same transaction as the business operation that generates them. A background scraper reads the outbox table and publishes messages to Kafka. This separates the durability guarantee (the database transaction) from the messaging guarantee (Kafka delivery), and proved effective during the August 2025 outage when Kafka was temporarily unavailable but no events were lost.

Single-threaded consumer poll loop for controlled rebalancing

Rather than handling partition assignment callbacks on a separate thread, the Scheduler's consumer design runs all Kafka consumer interaction, including reassignment callbacks, within a single poll thread. This simplifies shutdown during rebalancing: the callback can block, wait for in-flight work to drain, and close resources in a predictable order before yielding the partition. The trade-off is that the poll thread cannot process new messages during a rebalance, which the design accepts as a reasonable cost for operational simplicity.

Operating Kafka at scale

Monitoring and observability

The Scheduler service uses Datadog dashboards with configurable tags to track instances across applications and environments. Published metrics include tasks enqueued to Kafka, task batch persistence to Cassandra, task execution counts, task queue depth in the SchedulerActor and ExecutorActor, and stale task counts via raw Cassandra queries.

The EIA service uses a GenServer-based health check that polls Kafka partition offsets every 10 seconds and compares them against a stored snapshot. Two earlier implementations were discarded: a time-based threshold approach produced false positives on low-throughput topics with natural traffic gaps, and an offset-tracking approach that updated state on every message caused the GenServer to crash under production load.

The August 2025 incident report identified gaps in monitoring that allowed the producer proliferation bug to go undetected until customer impact was severe: no alerting on Kafka producer count, no JVM heap alerting on the Kafka tier, no Kafka producer or consumer telemetry anomaly detection, and no automated service dependency map visibility.

Incident response and failover

For the Scheduler service, the documented runbook for datacenter degradation is: shut down all Scheduler application nodes, then shut down all Kafka brokers, then shut down all Cassandra nodes in the affected datacenter. This sequence allows the healthy datacenter to recover without split-brain state.

The August 2025 outage was mitigated in both instances by doubling JVM heap size and performing rolling restarts. Root cause identification required correlating JVM heap exhaustion with Kafka producer count metrics, a connection that was not immediately obvious from the monitoring available at the time.

Planned improvements post-August 2025

Following the incident, PagerDuty committed to: expanding JVM and Kafka-level monitoring, strengthening service dependency mapping, automating customer impact metrics into incident workflows, adding stricter change management guardrails with rollout readiness checklists, running monthly chaos engineering drills for incident procedures, and automating status page update workflows.

Challenges and how they solved them

Producer proliferation from a library API quirk (August 2025)

Problem: A new API usage-tracking feature caused 4.2 million new Kafka producers to be created per hour at 75% feature rollout, 84 times the normal rate, exhausting JVM heap across the cluster and rejecting roughly 95% of incoming events over a 38-minute window.

Root cause: The pekko-connectors-kafka library instantiates a new Kafka producer when Kafka settings are passed directly to it, rather than when a pre-created producer instance is passed. The API does not require a new keyword, making instantiation implicit and easy to miss in code review. The new feature triggered this path on every API request.

Solution: Mitigation in both incidents involved doubling JVM heap size and performing rolling restarts. After the second incident, engineers identified the root cause and rolled back the feature.

Outcome: No event data was lost due to the Transactional Outbox pattern. PagerDuty committed to a range of monitoring, change management, and chaos engineering improvements to prevent similar gaps. The incident affected US customers across two separate 38-minute windows separated by approximately 13 hours on August 28, 2025.

Dentry cache exhaustion causing Kafka host unresponsiveness (2020-2021)

Problem: Kafka hosts in PagerDuty's staging environment became intermittently unresponsive for tens of seconds, causing client connectivity failures, under-replicated partitions, and leader elections. Newly provisioned machines showed identical symptoms after several days, ruling out hardware degradation.

Root cause: A third-party vendor application generated filesystem lookups for non-existent file paths on the root directory, creating massive negative dentry cache entries in the Linux kernel. perf top showed the kernel spending 55% of CPU time in __fsnotify_update_child_dentry_flags. The negative dentry entries accumulated over time until they triggered the lockup behaviour.

Solution: Dropping the dentry cache immediately resolved the lockup. The vendor issued a patched build that fixed the root cause.

Outcome: The investigation, documented by Tamim Khan, illustrates how Kafka host instability can originate entirely outside Kafka itself, and highlights the value of kernel-level profiling tools when broker behaviour is inexplicable at the application layer.

Health check false positives for low-throughput topics (2020)

Problem: The EIA service's first health check approach flagged a consumer as unhealthy if no message had been processed in the previous 10 seconds. This produced false positives on topics with naturally sparse traffic.

Root cause: The threshold-based design assumed continuous message flow. A second approach that tracked offsets on every event crashed the GenServer under production message volumes.

Solution: The final design polls Kafka partition offsets every 10 seconds and compares the result against a stored snapshot. If offsets have not advanced across two consecutive polls on an active topic, the consumer is flagged as stalled. This approach is independent of message volume and stable under production load.

Scaling notification delivery beyond Cassandra polling (pre-2019)

Problem: The Artemis notification system used Cassandra polling as its queuing mechanism. The polling approach tied all processing to a single US West Coast datacenter to keep query latency acceptable, limiting geographic redundancy and imposing high infrastructure costs.

Solution: NSS replaced Artemis with an Elixir/Kafka design that routes users to fixed partition sets across topic categories. Kafka handles message distribution; Elixir processes on BEAM hold user state in memory without database polling.

Outcome: 10x throughput, half the lag, and one-tenth the compute and storage footprint at launch. The design also removed the datacenter placement constraint that had limited Artemis.

Full tech stack

Category Tool Role
Message broker Apache Kafka Core async messaging layer for event ingestion, notification delivery, task scheduling, and incident timeline event streaming
Stream processing ksqlDB Referenced as part of the Kafka tier in the August 2025 incident report; no further architectural detail published
Storage Apache Cassandra Durable task persistence in the Scheduler; primary datastore in the prior Artemis notification system
Storage ElastiCache (Redis) Stores ingested event state consumed from EIA Kafka topics
Runtime (JVM) Scala with Akka / Pekko Implementation language for the Scheduler library; Pekko (formerly Akka) provides the actor framework and the pekko-connectors-kafka Kafka client abstraction
Runtime (BEAM) Elixir Runtime for NSS, the EIA health check service, and the incident timeline CQRS service
Kafka client pekko-connectors-kafka Kafka producer and consumer abstraction used by the Events API JVM service; source of the August 2025 producer proliferation bug
Service discovery Consul Service discovery and health endpoint polling at 10-second intervals for the EIA service
Monitoring Datadog Dashboards and metrics for the Scheduler service, tracking task enqueue rates, execution counts, and queue depth
Architectural pattern Transactional Outbox Guarantees at-least-once Kafka message delivery by persisting messages to a database table before publishing
Architectural pattern CQRS Applied to the incident timeline Elixir/Kafka service to separate write and read models

Key contributors

Name Title / team Contribution
David Van Geest Engineer, PagerDuty Authored the "Distributed Task Scheduling with Akka, Kafka, Cassandra" blog series (April and August 2017); primary designer of the open-source Scheduler library
Chris Morris Senior Software Engineer, PagerDuty Presented "From Propeller to Jet: Upgrading Your Engines Mid-Flight" at Kafka Summit San Francisco 2018, describing the migration from Cassandra queuing to Kafka for event ingestion
Elora Burns Engineer, PagerDuty Authored "Elixir at PagerDuty: Faster Processing with Stateful Services" (June 2020), documenting the NSS Kafka partitioning architecture and performance results
Tanvir Pathan Engineer, PagerDuty Authored "Writing Intelligent Health Checks for Kafka Services" (May 2020), detailing the EIA Kafka consumer health check design and iteration
Tamim Khan Engineer, PagerDuty Authored "How Lookups for Non-Existent Files Can Lead to Host Unresponsiveness" (May 2021), diagnosing the dentry cache issue affecting Kafka hosts in staging
Jon Grieman Senior Software Developer, PagerDuty Presented "Elixir + CQRS: Architecting for Availability, Operability, and Maintainability at PagerDuty" at ElixirConf 2019, covering the Kafka-backed incident timeline service

Key takeaways for your own Kafka implementation

  • Consistent partition routing enables stateful in-memory consumers. NSS's fixed user-to-partition mapping means each Elixir process owns a predictable slice of user state. If you're building a stateful Kafka consumer, designing your partition key and count before you go live gives you a stable foundation that is difficult to change later.
  • Library API design can conceal instantiation behaviour. The pekko-connectors-kafka producer proliferation bug originated from an API where passing a settings object to a method created a new producer silently, without any indication to the caller. When adopting a Kafka client library, verify how it handles producer and consumer lifecycle, particularly whether convenience methods reuse or recreate underlying resources.
  • The Transactional Outbox pattern decouples durability from availability. PagerDuty lost no event data during a 38-minute outage because the business transaction and the Kafka publish were separated by an outbox table. If you cannot tolerate event loss during Kafka unavailability, the pattern is worth the additional complexity of the scraper component.
  • Kafka host instability can originate outside Kafka. The dentry cache exhaustion case shows that Kafka broker unresponsiveness was caused entirely by a third-party application generating filesystem lookups on the same host. Kernel-level profiling (perf top) was required to identify the root cause. If you experience intermittent broker lockups that don't correlate with Kafka metrics, look at what else is running on the broker hosts.
  • Partition count is a provisioning-time decision. The Scheduler library documentation explicitly notes that partition count cannot be changed at runtime and recommends setting it in the "low hundreds" upfront. PagerDuty also notes the approximately 50,000-partition-per-cluster limit when designing NSS. Plan your partition count before your first production deployment, not as a response to scale pressure.

Sources and further reading

  1. August 28 Kafka outages: what happened and how we're improving — PagerDuty Engineering Blog, September 2025
  2. Writing intelligent health checks for Kafka services — Tanvir Pathan, PagerDuty Engineering Blog, May 2020
  3. Elixir at PagerDuty: faster processing with stateful services — Elora Burns, PagerDuty Engineering Blog, June 2020
  4. Distributed task scheduling with Akka, Kafka, and Cassandra (Part 1) — David Van Geest, PagerDuty Engineering Blog, April 2017
  5. Distributed task scheduling with Akka, Kafka, and Cassandra (Part 2) — David Van Geest, PagerDuty Engineering Blog, August 2017
  6. How lookups for non-existent files can lead to host unresponsiveness — Tamim Khan, PagerDuty Engineering Blog, May 2021
  7. From Propeller to Jet: Upgrading Your Engines Mid-Flight — Chris Morris, Kafka Summit San Francisco 2018
  8. Elixir + CQRS: Architecting for Availability at PagerDuty — Jon Grieman, ElixirConf 2019
  9. PagerDuty/scheduler — GitHub, PagerDuty open-source (design.md, operations.md, user.md)
  10. PagerDuty Kafka Outage (InfoQ coverage) — InfoQ, September 2025

If you're running Kafka in production, Kpow gives you a unified interface for monitoring consumer lag, inspecting topic data, and managing cluster configuration across multiple clusters. You can connect it to any Kafka cluster in minutes and try it free for 30 days.