
How New Relic uses Apache Kafka in production
Table of contents
New Relic processes more than 15 million messages per second across an aggregate data rate approaching 1 Tbps, routing every telemetry data point through Apache Kafka from the moment it arrives to the moment it becomes queryable. Across more than 100 independent clusters, Kafka carries the full pipeline: ingestion, transformation, aggregation, and storage.
The engineering challenge New Relic solved was not just throughput. It was building a data pipeline that could absorb billions of data points per minute, isolate failures to a small portion of customers, and scale horizontally without rebuilding from scratch.
Company overview
New Relic is an observability platform used by engineers to monitor application performance, infrastructure, and distributed systems. The platform ingests data from agents running in customer environments and stores it in NRDB (New Relic Database), a proprietary time-series store capable of scanning trillions of events with a median query response time of around 60 milliseconds.
New Relic has been building on Kafka since at least 2018, when principal software engineer Amy Boyle published the company's first detailed account of how Kafka connected the platform's microservices and event processing pipelines. At that stage, the platform ran a single Kafka cluster in its own data centre, with thousands of broker nodes. By 2020, that cluster had reached the limits of horizontal scaling, which triggered a migration to Amazon Web Services and a shift to a cell-based architecture that now underpins the platform's fault isolation model.
New Relic's Kafka use cases
Telemetry data bus for NRDB
Kafka serves as the primary data bus for New Relic's Telemetry Data Platform. As described by Nic Benders, GVP and Chief Architect, Kafka "carries all customer data through each stage of our processing, evaluation, and storage pipeline." Multiple downstream services aggregate, normalise, decorate, and transform telemetry data as it moves through Kafka before reaching NRDB. In normal operation, Kafka also absorbs backpressure: if a downstream stage slows down, Kafka buffers incoming data without dropping it.
Event data processing - Events Pipeline team
The Events Pipeline team handles fine-grained monitoring records: application errors, page views, and e-commerce transactions. Each record flows through a chain of containerised processing services connected via Kafka topics, passing through parsing, query matching, and aggregation stages in sequence. The team owns the partitioning strategy, consumer assignment configuration, and aggregation pipeline design for this data.
Microservice coordination across 40+ engineering teams
New Relic's platform is built by more than 40 engineering teams. Kafka decouples these teams through asynchronous messaging, allowing services to produce and consume data independently without tight synchronous dependencies between producers and consumers.
"Time to glass" pipeline
Engineering teams on the Telemetry Data Platform track "time to glass": how long it takes for a data point ingested by a customer agent to become queryable. Kafka connects each processing stage in this chain, and the pipeline is instrumented with OpenTelemetry distributed tracing to measure latency at each hop.
Scale & throughput
Since the AWS migration, clusters are scoped by cell. Each cell contains one Kafka cluster alongside the ingestion service, pipeline services, and NRDB cluster it serves. Customer event data is partitioned by account to support efficient per-customer data locality within each cluster.
New Relic's Kafka architecture
From monolith to cells
Before 2020, New Relic ran its Kafka workload in a single cluster in its own data centre. As described by Wendy Shepperd, GVP of Engineering, the company had "a massive single Kafka cluster in our data center with thousands of nodes processing data." That cluster could not scale further horizontally.
The architectural response was to decompose the platform into independent cells, each a self-contained unit: one Kafka cluster, one ingestion service, one set of data pipeline services, and one NRDB cluster. With approximately 10 cells in the US region, a failure in one cell affects roughly 10% of customers rather than the full region. The 2021 incident (described below) revealed that infrastructure automation had not fully respected cell boundaries, resulting in a configuration change propagating across all cells simultaneously.
Amazon MSK
New Relic migrated to Amazon Managed Streaming for Apache Kafka (MSK) for its cell clusters as part of the AWS transition. Containerised consumer services run on Amazon EKS within each cell. The migration moved 95% of data ingestion to AWS in under one year.
Partition model
For permanent event storage, data is "partitioned by which customer account the data belongs to," supporting efficient per-account retrieval. For the CPU-intensive match service - which must hold all registered queries in memory - random partitioning is used to distribute load evenly across consumer instances without per-key routing logic.
Consumer architecture
The Events Pipeline team manages partition assignment configuration to minimise the cost of rebalances on stateful consumers. StickyAssignor reduces state shuffling during rebalances. CooperativeStickyAssignor (available from Kafka 2.4) allows consumers to continue processing during a rebalance rather than stopping entirely. Static Membership (available from Kafka 2.3) avoids triggering rebalances when clients consistently identify themselves across restarts.
Special techniques & engineering innovations
Changelog pattern for stateful service startup
Services that maintain in-memory state consume their Kafka topic from the earliest offset on startup, replaying history to reconstruct state. The Queries topic uses a short TTL-based retention window of one hour to keep the replay time bounded. An early misconfiguration - segment.ms not aligned with retention.ms - caused the topic to take minutes to read on startup. Aligning the two values resolved the issue immediately.
"Durable cache" pattern
Replaying large volumes of ingest data on every service restart was not practical at New Relic's scale. The solution was to introduce a "durable cache" pattern: stateful services periodically snapshot their in-memory state to a dedicated Kafka topic (a "snapshots" topic with the same partition count as the primary topic). On restart, a service reads the snapshot, then resumes consuming from the offset stored in the snapshot metadata, bypassing the need to replay the primary topic from the beginning.
LMAX Disruptor for ingest concurrency
New Relic uses the LMAX Disruptor - "an asynchronous blocking queue that distributes objects to worker threads" - in its ingest consumers to parallelise decompression, deserialisation, and business logic across threads. Each worker thread handles one partition, preserving single-threaded ordering guarantees per partition while maximising CPU utilisation across cores.
Two-stage aggregation for hotspot resolution
When the Events Pipeline team discovered that the top 1.5% of customer queries accounted for approximately 90% of processed events, it became clear that partitioning the aggregation topic by query ID was creating severe load imbalance. The solution was to split the aggregation service into two stages. Stage one uses random partitioning, distributing all events evenly across consumer instances for parallel partial aggregation. Stage two partitions by query ID, merging the partial results into final outputs. This arrangement significantly condenses stream volume before the final keyed step, eliminating the hotspot problem.
eBPF-based Kafka monitoring via Pixie
With more than 100 Kafka clusters and services written in multiple languages, deploying per-service Kafka instrumentation was impractical. Anton Rodriguez, Principal Software Engineer, presented New Relic's solution at Kafka Summit London 2022: Pixie, a CNCF open-source project that uses eBPF (Extended Berkeley Packet Filter) to observe Kafka traffic at the Linux kernel network layer without any application-level code changes. Pixie measures consumer lag in milliseconds rather than offset counts, identifies rebalancing events through control message analysis, and works across consumer languages on Kubernetes. The approach gives New Relic uniform observability across its full cluster estate without per-service instrumentation overhead.
Operating Kafka at scale
Deployment
New Relic runs Kafka on Amazon MSK, with one MSK cluster per cell. Each cell is an isolated unit: its own MSK cluster, ingestion frontend, processing services, and NRDB cluster. Consumer services run as containerised workloads on Amazon EKS within the same cell.
Monitoring and observability
Consumer lag is monitored in milliseconds via Pixie, providing a time-based view of pipeline health rather than a raw offset count. The full Telemetry Data Platform pipeline is instrumented with OpenTelemetry distributed tracing to track data from agent ingestion through each Kafka-connected transformation stage to the point it becomes queryable in NRDB. This "time to glass" metric gives engineers a latency view across the entire processing chain.
Incident response
The July 2021 incident led to a public retrospective in which New Relic identified five compounding failure points in its incident response tooling and automation: no safety guardrails in emergency tools to prevent dangerous configuration changes; human error under high-pressure conditions; false confidence built up from past tool use; alert noise masking critical disk-space warnings; and automation scope that allowed configuration changes to propagate beyond the affected cell. Post-incident commitments included adding guardrails to tooling and restricting the blast radius of automation to individual cells.
Challenges & how they solved them
Horizontal scaling limit on the monolithic cluster
Problem: The original single on-premises Kafka cluster, with thousands of broker nodes, could not scale further horizontally.
Root cause: A monolithic cluster design with no workload isolation between data types, teams, or customers.
Solution: Migration to AWS using Amazon MSK, replacing the single cluster with more than 100 cell-level clusters.
Outcome: 95% of data ingestion moved to AWS in under one year. Each new cell adds capacity independently, and failures are isolated to approximately 10% of customers.
July 2021: broker failure cascades across cells
Problem: A Kafka broker in one US-region cell became unresponsive. Engineers manually initiated a restart while automated remediation triggered simultaneously, extending the degradation window. To preserve data during recovery, engineers extended Kafka retention settings using internal tooling. The retention change was silently applied to all cells by the infrastructure automation, exhausting disk space across the platform. A disk-space alert fired but was missed amid simultaneous unrelated alerts.
Root cause: Five compounding failures: no safety guardrails in emergency tooling; human error under stress; false confidence from prior tool use; alert noise masking a critical signal; and automation scope that bypassed cell isolation.
Outcome: Multi-hour data collection and alerting outage for US-region customers. New Relic published a detailed retrospective committing to tooling and automation improvements. Source: Nic Benders, newrelic.com/blog/best-practices/new-relic-reliability-incident-learning.
Partition hotspots in the aggregation pipeline
Problem: The top 1.5% of customer queries accounted for approximately 90% of processed events. Partitioning the aggregation topic by query ID concentrated this workload on a small number of partitions, causing load imbalance.
Root cause: Uneven distribution of query weight in the customer base.
Solution: Two-stage aggregation pipeline - random partitioning for partial aggregation, then query-ID partitioning for final merge.
Outcome: Hot partitions eliminated; stream volume significantly reduced before the final keyed aggregation stage.
Consumer lag measurement across 100+ clusters
Problem: Instrumenting consumer lag across more than 100 clusters, served by services in multiple languages, was not scalable.
Root cause: Scale and heterogeneity of the cluster estate and application stack.
Solution: eBPF via Pixie, observing Kafka traffic at the kernel network layer without application changes.
Outcome: Consumer lag reported in milliseconds across all clusters and languages; rebalancing events detected from control messages; no per-service instrumentation required.
Log segment misconfiguration causing slow topic consumption
Problem: Applications consuming the Queries topic took minutes to read it on startup, creating unacceptably long initialisation times.
Root cause: segment.ms was not aligned with retention.ms, causing Kafka to create large log segments that contained mostly expired data but still had to be scanned.
Solution: Aligned segment.ms to match retention.ms.
Outcome: Consumption time reduced from minutes to a manageable window.
Full tech stack
Key contributors
Key takeaways for your own Kafka implementation
- Cell architecture changes the failure model. New Relic's shift from a single cluster to 100+ cell-scoped clusters reduced the blast radius of any single failure from 100% of customers to around 10%. If you are running a monolithic cluster and approaching scaling limits, decomposing by workload, region, or customer segment can give you both operational leverage and a clearer capacity model.
- State management at ingest scale requires an explicit design. New Relic developed two patterns - the changelog pattern for small, bounded state and the durable cache pattern for large, unbounded state - because the naive approach of replaying the full topic backlog on restart became impractical. If your Kafka consumers maintain in-memory state, these are worth evaluating before you hit scale limits.
- Partition hotspots are a distribution problem, not a Kafka problem. The Events Pipeline team's discovery that 1.5% of queries drove 90% of events required a two-stage aggregation design rather than a configuration change. When you see uneven partition load, the partition key is usually the lever to adjust, and two-stage approaches (random partitioning for distribution, then keyed partitioning for correctness) are a repeatable solution.
- Instrumentation at scale may require kernel-level approaches. Once New Relic crossed 100 clusters across heterogeneous service languages, per-service Kafka instrumentation was no longer a viable path. eBPF via Pixie gave uniform coverage without application changes. If you are managing Kafka at a similar scale, it is worth evaluating whether observability approaches that bypass the application layer are better suited to your environment.
- Retention changes during incidents carry hidden risk. New Relic's 2021 post-mortem is a useful reference for any team that uses emergency tooling to modify Kafka configuration under pressure. Extending retention to preserve data is a reasonable instinct, but the interaction between retention settings and available disk space, compounded by automation that did not respect cell boundaries, turned a contained failure into a region-wide outage. Guardrails in operational tooling and well-scoped automation blast radius are worth building before you need them.
Sources & further reading
If you are running Kafka in production, Kpow gives your team a single interface for monitoring consumer lag, inspecting topics, and managing cluster configuration. You can connect it to any Kafka cluster in minutes and try it free for 30 days.