Abstract digital artwork featuring smooth, overlapping curved shapes in shades of green and blue on a black background.

How New Relic uses Apache Kafka in production

Table of contents

Factor House
May 16th, 2026
xx min read

New Relic processes more than 15 million messages per second across an aggregate data rate approaching 1 Tbps, routing every telemetry data point through Apache Kafka from the moment it arrives to the moment it becomes queryable. Across more than 100 independent clusters, Kafka carries the full pipeline: ingestion, transformation, aggregation, and storage.

The engineering challenge New Relic solved was not just throughput. It was building a data pipeline that could absorb billions of data points per minute, isolate failures to a small portion of customers, and scale horizontally without rebuilding from scratch.

Company overview

New Relic is an observability platform used by engineers to monitor application performance, infrastructure, and distributed systems. The platform ingests data from agents running in customer environments and stores it in NRDB (New Relic Database), a proprietary time-series store capable of scanning trillions of events with a median query response time of around 60 milliseconds.

New Relic has been building on Kafka since at least 2018, when principal software engineer Amy Boyle published the company's first detailed account of how Kafka connected the platform's microservices and event processing pipelines. At that stage, the platform ran a single Kafka cluster in its own data centre, with thousands of broker nodes. By 2020, that cluster had reached the limits of horizontal scaling, which triggered a migration to Amazon Web Services and a shift to a cell-based architecture that now underpins the platform's fault isolation model.

Date Event
2018 Amy Boyle publishes the first public engineering account of New Relic's Kafka architecture, covering event processing, the changelog pattern, and the durable cache pattern
2020 New Relic begins migrating from a single on-premises Kafka cluster (275+ brokers) to a cell-based architecture on AWS using Amazon MSK
2021 Migration completes; 95% of data ingestion moved to AWS, with more than 100 independent Kafka clusters now in operation
2021-07-29 Major incident: a Kafka broker failure, conflicting manual and automated recovery, and a retention change propagated across all cells causes a multi-hour data collection and alerting outage for US-region customers
2022-04 Anton Rodriguez presents the company's eBPF-based Kafka monitoring approach at Kafka Summit London
2024-01 Tony Mancill's Kafka best practices article updated, citing 15 million messages/sec and an aggregate data rate approaching 1 Tbps

New Relic's Kafka use cases

Telemetry data bus for NRDB

Kafka serves as the primary data bus for New Relic's Telemetry Data Platform. As described by Nic Benders, GVP and Chief Architect, Kafka "carries all customer data through each stage of our processing, evaluation, and storage pipeline." Multiple downstream services aggregate, normalise, decorate, and transform telemetry data as it moves through Kafka before reaching NRDB. In normal operation, Kafka also absorbs backpressure: if a downstream stage slows down, Kafka buffers incoming data without dropping it.

Event data processing - Events Pipeline team

The Events Pipeline team handles fine-grained monitoring records: application errors, page views, and e-commerce transactions. Each record flows through a chain of containerised processing services connected via Kafka topics, passing through parsing, query matching, and aggregation stages in sequence. The team owns the partitioning strategy, consumer assignment configuration, and aggregation pipeline design for this data.

Microservice coordination across 40+ engineering teams

New Relic's platform is built by more than 40 engineering teams. Kafka decouples these teams through asynchronous messaging, allowing services to produce and consume data independently without tight synchronous dependencies between producers and consumers.

"Time to glass" pipeline

Engineering teams on the Telemetry Data Platform track "time to glass": how long it takes for a data point ingested by a customer agent to become queryable. Kafka connects each processing stage in this chain, and the pipeline is instrumented with OpenTelemetry distributed tracing to measure latency at each hop.

Scale & throughput

Metric Figure Source
Messages per second More than 15 million Tony Mancill, New Relic engineering blog
Aggregate data rate Approaching 1 Tbps Tony Mancill, New Relic engineering blog
Data points per minute ingested into NRDB Over 3 billion Nic Benders; Wendy Shepperd, New Relic engineering blog
Data volume per month 150 PB (at migration, 2021); 200 PB (later post) Wendy Shepperd; Daniel Kim, New Relic engineering blog
Kafka clusters in operation More than 100 Anton Rodriguez, Confluent podcast / Kafka Summit London 2022
Brokers in original single cluster Over 275 Anton Rodriguez, Confluent podcast / Kafka Summit London 2022

Since the AWS migration, clusters are scoped by cell. Each cell contains one Kafka cluster alongside the ingestion service, pipeline services, and NRDB cluster it serves. Customer event data is partitioned by account to support efficient per-customer data locality within each cluster.

New Relic's Kafka architecture

From monolith to cells

Before 2020, New Relic ran its Kafka workload in a single cluster in its own data centre. As described by Wendy Shepperd, GVP of Engineering, the company had "a massive single Kafka cluster in our data center with thousands of nodes processing data." That cluster could not scale further horizontally.

The architectural response was to decompose the platform into independent cells, each a self-contained unit: one Kafka cluster, one ingestion service, one set of data pipeline services, and one NRDB cluster. With approximately 10 cells in the US region, a failure in one cell affects roughly 10% of customers rather than the full region. The 2021 incident (described below) revealed that infrastructure automation had not fully respected cell boundaries, resulting in a configuration change propagating across all cells simultaneously.

Amazon MSK

New Relic migrated to Amazon Managed Streaming for Apache Kafka (MSK) for its cell clusters as part of the AWS transition. Containerised consumer services run on Amazon EKS within each cell. The migration moved 95% of data ingestion to AWS in under one year.

Partition model

For permanent event storage, data is "partitioned by which customer account the data belongs to," supporting efficient per-account retrieval. For the CPU-intensive match service - which must hold all registered queries in memory - random partitioning is used to distribute load evenly across consumer instances without per-key routing logic.

Consumer architecture

The Events Pipeline team manages partition assignment configuration to minimise the cost of rebalances on stateful consumers. StickyAssignor reduces state shuffling during rebalances. CooperativeStickyAssignor (available from Kafka 2.4) allows consumers to continue processing during a rebalance rather than stopping entirely. Static Membership (available from Kafka 2.3) avoids triggering rebalances when clients consistently identify themselves across restarts.

Special techniques & engineering innovations

Changelog pattern for stateful service startup

Services that maintain in-memory state consume their Kafka topic from the earliest offset on startup, replaying history to reconstruct state. The Queries topic uses a short TTL-based retention window of one hour to keep the replay time bounded. An early misconfiguration - segment.ms not aligned with retention.ms - caused the topic to take minutes to read on startup. Aligning the two values resolved the issue immediately.

"Durable cache" pattern

Replaying large volumes of ingest data on every service restart was not practical at New Relic's scale. The solution was to introduce a "durable cache" pattern: stateful services periodically snapshot their in-memory state to a dedicated Kafka topic (a "snapshots" topic with the same partition count as the primary topic). On restart, a service reads the snapshot, then resumes consuming from the offset stored in the snapshot metadata, bypassing the need to replay the primary topic from the beginning.

LMAX Disruptor for ingest concurrency

New Relic uses the LMAX Disruptor - "an asynchronous blocking queue that distributes objects to worker threads" - in its ingest consumers to parallelise decompression, deserialisation, and business logic across threads. Each worker thread handles one partition, preserving single-threaded ordering guarantees per partition while maximising CPU utilisation across cores.

Two-stage aggregation for hotspot resolution

When the Events Pipeline team discovered that the top 1.5% of customer queries accounted for approximately 90% of processed events, it became clear that partitioning the aggregation topic by query ID was creating severe load imbalance. The solution was to split the aggregation service into two stages. Stage one uses random partitioning, distributing all events evenly across consumer instances for parallel partial aggregation. Stage two partitions by query ID, merging the partial results into final outputs. This arrangement significantly condenses stream volume before the final keyed step, eliminating the hotspot problem.

eBPF-based Kafka monitoring via Pixie

With more than 100 Kafka clusters and services written in multiple languages, deploying per-service Kafka instrumentation was impractical. Anton Rodriguez, Principal Software Engineer, presented New Relic's solution at Kafka Summit London 2022: Pixie, a CNCF open-source project that uses eBPF (Extended Berkeley Packet Filter) to observe Kafka traffic at the Linux kernel network layer without any application-level code changes. Pixie measures consumer lag in milliseconds rather than offset counts, identifies rebalancing events through control message analysis, and works across consumer languages on Kubernetes. The approach gives New Relic uniform observability across its full cluster estate without per-service instrumentation overhead.

Operating Kafka at scale

Deployment

New Relic runs Kafka on Amazon MSK, with one MSK cluster per cell. Each cell is an isolated unit: its own MSK cluster, ingestion frontend, processing services, and NRDB cluster. Consumer services run as containerised workloads on Amazon EKS within the same cell.

Monitoring and observability

Consumer lag is monitored in milliseconds via Pixie, providing a time-based view of pipeline health rather than a raw offset count. The full Telemetry Data Platform pipeline is instrumented with OpenTelemetry distributed tracing to track data from agent ingestion through each Kafka-connected transformation stage to the point it becomes queryable in NRDB. This "time to glass" metric gives engineers a latency view across the entire processing chain.

Incident response

The July 2021 incident led to a public retrospective in which New Relic identified five compounding failure points in its incident response tooling and automation: no safety guardrails in emergency tools to prevent dangerous configuration changes; human error under high-pressure conditions; false confidence built up from past tool use; alert noise masking critical disk-space warnings; and automation scope that allowed configuration changes to propagate beyond the affected cell. Post-incident commitments included adding guardrails to tooling and restricting the blast radius of automation to individual cells.

Challenges & how they solved them

Horizontal scaling limit on the monolithic cluster

Problem: The original single on-premises Kafka cluster, with thousands of broker nodes, could not scale further horizontally.

Root cause: A monolithic cluster design with no workload isolation between data types, teams, or customers.

Solution: Migration to AWS using Amazon MSK, replacing the single cluster with more than 100 cell-level clusters.

Outcome: 95% of data ingestion moved to AWS in under one year. Each new cell adds capacity independently, and failures are isolated to approximately 10% of customers.

July 2021: broker failure cascades across cells

Problem: A Kafka broker in one US-region cell became unresponsive. Engineers manually initiated a restart while automated remediation triggered simultaneously, extending the degradation window. To preserve data during recovery, engineers extended Kafka retention settings using internal tooling. The retention change was silently applied to all cells by the infrastructure automation, exhausting disk space across the platform. A disk-space alert fired but was missed amid simultaneous unrelated alerts.

Root cause: Five compounding failures: no safety guardrails in emergency tooling; human error under stress; false confidence from prior tool use; alert noise masking a critical signal; and automation scope that bypassed cell isolation.

Outcome: Multi-hour data collection and alerting outage for US-region customers. New Relic published a detailed retrospective committing to tooling and automation improvements. Source: Nic Benders, newrelic.com/blog/best-practices/new-relic-reliability-incident-learning.

Partition hotspots in the aggregation pipeline

Problem: The top 1.5% of customer queries accounted for approximately 90% of processed events. Partitioning the aggregation topic by query ID concentrated this workload on a small number of partitions, causing load imbalance.

Root cause: Uneven distribution of query weight in the customer base.

Solution: Two-stage aggregation pipeline - random partitioning for partial aggregation, then query-ID partitioning for final merge.

Outcome: Hot partitions eliminated; stream volume significantly reduced before the final keyed aggregation stage.

Consumer lag measurement across 100+ clusters

Problem: Instrumenting consumer lag across more than 100 clusters, served by services in multiple languages, was not scalable.

Root cause: Scale and heterogeneity of the cluster estate and application stack.

Solution: eBPF via Pixie, observing Kafka traffic at the kernel network layer without application changes.

Outcome: Consumer lag reported in milliseconds across all clusters and languages; rebalancing events detected from control messages; no per-service instrumentation required.

Log segment misconfiguration causing slow topic consumption

Problem: Applications consuming the Queries topic took minutes to read it on startup, creating unacceptably long initialisation times.

Root cause: segment.ms was not aligned with retention.ms, causing Kafka to create large log segments that contained mostly expired data but still had to be scanned.

Solution: Aligned segment.ms to match retention.ms.

Outcome: Consumption time reduced from minutes to a manageable window.

Full tech stack

Category Tools Notes
Message broker Apache Kafka Core data bus for NRDB and all inter-service telemetry routing; one cluster per cell
Managed Kafka Amazon MSK Kafka hosting for all cell clusters on AWS, adopted during the 2020 migration
Container orchestration Amazon EKS Kubernetes platform for containerised Kafka consumers and data pipeline services within each cell
Kafka monitoring Pixie (CNCF) eBPF-based consumer lag and rebalance monitoring across 100+ clusters without per-service instrumentation
Distributed tracing OpenTelemetry Traces data from agent ingestion through all Kafka-connected processing stages to NRDB for "time to glass" measurement
Concurrency library LMAX Disruptor Asynchronous blocking queue used by ingest consumers to parallelise decompression, deserialisation, and business logic
Database NRDB Proprietary time-series database; downstream destination for all Kafka-processed telemetry; stores metrics, events, logs, and traces

Key contributors

Name Title / Team Contribution
Amy Boyle Principal Software Engineer, New Relic Authored the primary engineering posts on event processing architecture (changelog pattern, durable cache, LMAX Disruptor) and partitioning strategies (two-stage aggregation, hotspot resolution, assignor selection)
Tony Mancill Lead Software Engineer, data ingest team, New Relic Authored "20 best practices for Apache Kafka at scale," drawing on production experience at 15 million messages/sec; primary source for throughput figures
Anton Rodriguez Principal Software Engineer, New Relic Presented "Monitoring Kafka Without Instrumentation Using eBPF" at Kafka Summit London 2022; primary source for cluster count (100+), broker count (275+), and the Pixie monitoring approach
Nic Benders GVP and Chief Architect, New Relic Authored the July 2021 incident retrospective; primary source for cell architecture design and Kafka's role as the platform data bus
Wendy Shepperd GVP of Engineering, New Relic Primary source for the AWS migration: original monolithic cluster constraints, cell architecture shift, Amazon MSK adoption, and the 150 PB/month scale figure
Daniel Kim Principal Developer Relations Engineer, New Relic Authored the distributed tracing with OpenTelemetry post; source for the "time to glass" metric and the 200 PB/month data volume figure

Key takeaways for your own Kafka implementation

  • Cell architecture changes the failure model. New Relic's shift from a single cluster to 100+ cell-scoped clusters reduced the blast radius of any single failure from 100% of customers to around 10%. If you are running a monolithic cluster and approaching scaling limits, decomposing by workload, region, or customer segment can give you both operational leverage and a clearer capacity model.
  • State management at ingest scale requires an explicit design. New Relic developed two patterns - the changelog pattern for small, bounded state and the durable cache pattern for large, unbounded state - because the naive approach of replaying the full topic backlog on restart became impractical. If your Kafka consumers maintain in-memory state, these are worth evaluating before you hit scale limits.
  • Partition hotspots are a distribution problem, not a Kafka problem. The Events Pipeline team's discovery that 1.5% of queries drove 90% of events required a two-stage aggregation design rather than a configuration change. When you see uneven partition load, the partition key is usually the lever to adjust, and two-stage approaches (random partitioning for distribution, then keyed partitioning for correctness) are a repeatable solution.
  • Instrumentation at scale may require kernel-level approaches. Once New Relic crossed 100 clusters across heterogeneous service languages, per-service Kafka instrumentation was no longer a viable path. eBPF via Pixie gave uniform coverage without application changes. If you are managing Kafka at a similar scale, it is worth evaluating whether observability approaches that bypass the application layer are better suited to your environment.
  • Retention changes during incidents carry hidden risk. New Relic's 2021 post-mortem is a useful reference for any team that uses emergency tooling to modify Kafka configuration under pressure. Extending retention to preserve data is a reasonable instinct, but the interaction between retention settings and available disk space, compounded by automation that did not respect cell boundaries, turned a contained failure into a region-wide outage. Guardrails in operational tooling and well-scoped automation blast radius are worth building before you need them.

Sources & further reading

# Source Author Date
1 Using Apache Kafka for Real-Time Event Processing at New Relic Amy Boyle, New Relic 2018
2 Effective strategies for Kafka topic partitioning Amy Boyle, New Relic Updated 2024
3 20 best practices for Apache Kafka at scale Tony Mancill, New Relic Updated January 2024
4 Monitoring extreme-scale Apache Kafka using eBPF at New Relic Anton Rodriguez, New Relic Kafka Summit London, April 2022
5 Our commitment to reliability and incident learning Nic Benders, New Relic August 2021
6 Transitioning to the cloud: New Relic's journey to AWS Wendy Shepperd, New Relic 2021
7 Distributed tracing for Kafka with OpenTelemetry Daniel Kim, New Relic 2022

If you are running Kafka in production, Kpow gives your team a single interface for monitoring consumer lag, inspecting topics, and managing cluster configuration. You can connect it to any Kafka cluster in minutes and try it free for 30 days.