
How Notion uses Apache Kafka in production
Table of contents
Notion is a productivity and knowledge management platform that lets individuals and teams build wikis, project trackers, databases, and documents in a single workspace. The core data model is block-based: every piece of content, from a paragraph to a table row to an inline image, is a block stored in Postgres.
Notion grew by millions of members in 2021, driven in part by pandemic-era adoption of remote collaboration tools. That growth exposed the limits of its existing data infrastructure: the legacy messaging architecture was not going to scale, and moving large Postgres datasets to analytics systems was taking more than a day end to end.
Notion began building a new data lake infrastructure in spring 2022, completed it by autumn 2022, and has continued extending it since. The same Kafka-based pipeline that solved the analytics latency problem in 2022 now also powers Notion AI search, retrieval-augmented generation (RAG), and real-time product integrations.
Key Kafka milestones:
- Spring 2022: Notion begins building in-house data lake on Debezium, Kafka, Apache Hudi, Spark, and S3
- Autumn 2022: Data lake pipeline completed; ingestion latency drops from more than a day to minutes for most tables
- 2022–2023: Migration to event-driven architecture on Confluent Cloud completed; engineering team reports tripling its productivity
- 2023–2024: Notion AI features (Autofill, AI search, RAG) launched on top of the Kafka-backed data pipeline
- May 2024: Pipeline processing tens of MB/sec of Postgres row changes with minimal operational overhead, delivering net savings of over $1 million per year
- April 2026: Multi-region Kafka architecture for data residency compliance documented publicly
Notion's Kafka use cases
Change data capture for data lake ingestion
The primary use of Kafka at Notion is as the message bus in its CDC pipeline. Debezium captures row-level changes from each Postgres host and publishes them to Kafka. A Spark-based job (Apache Hudi Deltastreamer) consumes those messages and replicates the full state of each Postgres table into S3, where it is available for analytics, AI training, and search indexing.
Before this pipeline existed, batch exports moved data from Postgres to analytics systems in over a day. After the migration, most tables are available within minutes; the largest table (Notion's block table) takes up to two hours. That latency range is acceptable for the analytics and AI workloads, which do not require sub-second freshness.
Event logging for analytics
The logging API endpoint writes analytics events to a regional Apache Kafka cluster when users take actions such as creating a block. Kafka then sinks that data to a region-specific Snowflake table, where it feeds usage analytics and business reporting.
AI and vector database synchronisation
Updates to workspace content are written to a regional Kafka cluster. Spark jobs consume those messages, generate updated embeddings, and write them back to the regional vector database. This keeps Notion AI's search and generation capabilities current as users edit their workspaces.
Adam Hudson, Senior Software Engineer at Notion, described the goal: "Confluent's platform allows us to stream changes as they happen, ensuring AI tools provide relevant information."
Product integrations and real-time notifications
Notion uses Confluent Cloud's managed Kafka to power real-time notifications and data synchronisation with third-party tools including Slack, Jira, and Google Drive. Pre-built Confluent connectors replaced a previous custom connector approach. Ekanth Sethuramalingam, Engineering Lead at Notion, described the old approach as "too expensive and difficult to maintain at scale."
Scale and throughput
As documented by XZ Tie and co-authors (Notion Blog, July 2024) and Ekanth Sethuramalingam and Adam Hudson (Confluent case study):
- Daily active users: 100+ million
- Postgres shards feeding Kafka: 480
- Throughput: Tens of MB/sec of Postgres row changes (as of May 2024)
- Ingestion latency: Minutes for most tables; up to two hours for the block table
- Cost impact: Net savings of over $1 million in 2022, with proportionally higher savings in 2023 and 2024
- Productivity gain: Engineering team tripled its productivity after migrating to Confluent Cloud
Notion's Kafka architecture
Deployment model
Notion runs Confluent Cloud (fully managed Apache Kafka) integrated with AWS. Ekanth Sethuramalingam summarised the decision: "We believe in solving problems unique to Notion. Having our team manage infrastructure was out of the question." Confluent Cloud handles automatic cluster scaling, storage retention adjustments, and broker upgrades without manual intervention.
Debezium CDC connectors are the one component that Notion self-manages, running on AWS EKS (Elastic Kubernetes Service). The team configures one Debezium connector per Postgres host.
Topic design
Notion uses one Kafka topic per Postgres table. All Debezium connectors across all 480 Postgres shards write to the same topic for each table. The alternative, one topic per shard per table, would have created 480 times as many topics and significantly complicated downstream Hudi ingestion. The consolidated approach trades some message-level isolation for substantial reductions in topic count and operational complexity.
Regional Kafka clusters
Notion runs separate Kafka clusters per data-residency region, currently US and EU. Each region has its own Debezium, Kafka, Spark, and Snowflake stack. Customer data is written to its region's Kafka cluster and processed entirely within that region. This design satisfies data residency compliance requirements without requiring message-level filtering or cross-region routing.
Schema management
Confluent Schema Registry is in use for data governance. Notion has also indicated plans to adopt Tableflow, Confluent's feature for materialising Kafka topics into Apache Iceberg or Delta Lake tables.
Data pipeline flows
CDC data lake pipeline:
Postgres (480 shards)
-> Debezium CDC (AWS EKS, one connector per Postgres host)
-> Kafka (one topic per table, Confluent Cloud)
-> Apache Hudi Deltastreamer (Spark, COPY_ON_WRITE + UPSERT)
-> S3 (data lake)
-> Snowflake (Snowpipe or Snowflake Sink Connector)
AI/embeddings pipeline:
Workspace content changes
-> Regional Kafka cluster
-> Spark jobs (embedding generation)
-> Regional vector database
Event logging pipeline:
API endpoint (user action events)
-> Regional Kafka cluster
-> Region-specific Snowflake table
Kafka Connect ecosystem
Debezium (source connector) and Snowflake Sink Connector are the two primary connectors in production. S3 and PostgreSQL connectors are also listed as part of Notion's Confluent Cloud setup.
Special techniques and engineering innovations
Consolidated fan-in from 480 shards into per-table topics
Rather than creating one topic per shard per table, Notion routes CDC events from all 480 Postgres shards into a single topic per table. This reduces the total topic count by a factor of 480 relative to a shard-per-topic design. Consumers, including the Hudi Deltastreamer job, see a unified stream of row changes regardless of which shard originated them.
Apache Hudi with COPY_ON_WRITE and UPSERT for update-heavy workloads
Notion's data is highly update-driven. A collaborative document block can be edited continuously by multiple users. Notion evaluated several open table formats and chose Apache Hudi specifically because of its native integration with Debezium CDC message formats and better performance characteristics on workloads where updates dominate over appends. They use the COPY_ON_WRITE table type with UPSERT operations, which rewrites data files on each batch to maintain a fully queryable S3 table.
Full-stack regional replication for data residency
Notion's approach to data residency is to avoid cross-region data flows entirely rather than filter or redact messages in transit. Each data-residency region runs its own complete Kafka-based pipeline stack. When a user in the EU edits a block, the change is captured locally, written to the EU Kafka cluster, processed by EU Spark jobs, and stored in the EU vector database and Snowflake instance. No portion of that data touches infrastructure outside the EU.
Operating Kafka at scale
Deployment model: Confluent Cloud for the Kafka broker layer; AWS EKS for Debezium connectors. The decision to use a fully managed broker service was explicit: Notion's engineering team is focused on product problems, and managing Kafka broker infrastructure was considered a poor use of that capacity.
Connector stability: Because of the maturity of Debezium and EKS management tooling, Notion has upgraded the EKS and Kafka clusters only a few times over two years of production operation. The authors of the data lake post note this as one of the practical benefits of the chosen stack.
Data freshness monitoring: Notion tracks end-to-end ingestion latency per table. Small tables appear in Snowflake within minutes; the block table, which is the largest and most heavily updated, takes up to two hours. This per-table latency tracking allows the team to detect ingestion slowdowns before they affect downstream analytics or AI features.
Pre-built connector strategy: Rather than building custom connectors to third-party tools, Notion relies on Confluent Cloud's pre-built connector library. This reduces maintenance overhead and allows the data platform team to focus on pipelines that are specific to Notion.
Challenges and how they solved them
Legacy messaging architecture couldn't scale with rapid growth
Notion grew by millions of members in 2021. The existing messaging infrastructure was not built for that scale, and custom connectors to third-party tools required ongoing maintenance investment. Notion migrated to an event-driven architecture on Confluent Cloud. Within approximately one year, the migration was complete, the custom connector approach was retired, and the engineering team reported tripling its productivity.
Batch Postgres exports taking more than a day to reach analytics
Before the data lake migration, moving Postgres data to analytics systems relied on batch exports that took over a day end to end. That lag made it difficult to build analytics and AI features that required reasonably fresh data. Notion built the CDC pipeline with Debezium, Kafka, and Hudi Deltastreamer to replace batch exports with a continuous incremental stream. End-to-end ingestion time dropped to minutes for most tables.
Data residency compliance across regions
Expanding internationally required Notion to guarantee that EU customer data would never leave the EU. A centralised Kafka cluster would require complex per-message routing logic to satisfy this. Notion's solution was to replicate the entire pipeline stack independently in each region. Cross-region data flows are not needed because each region is self-contained.
Full tech stack
Key contributors
- XZ Tie, Nathan Louie, Thomas Chow, Darin Im, Abhishek Modi, Wendy Jiao — Co-authors of the data lake architecture post, Notion Blog, 1 July 2024
- Ekanth Sethuramalingam (Engineering Lead, Notion) — Quoted on the migration to Confluent Cloud and the rationale for managed infrastructure, Confluent case study
- Adam Hudson (Senior Software Engineer, Notion) — Quoted on Confluent's role in AI feature delivery; co-authored the multi-region data systems post, Confluent case study and Notion Blog, 9 April 2026
- Justin Lee (Engineering, Notion) — Co-authored the multi-region data systems architecture post, Notion Blog, 9 April 2026
Key takeaways for your own Kafka implementation
- Consolidating multi-shard CDC into per-table topics reduces operational complexity significantly. If you are running CDC from a sharded Postgres setup, routing all shards to a single topic per table rather than a topic per shard per table can reduce your total topic count by an order of magnitude and simplify consumer configuration.
- Choose your table format based on your write pattern, not just ecosystem popularity. Notion evaluated open table formats and selected Apache Hudi specifically because of its update-heavy workload characteristics and native Debezium CDC integration. If your CDC workload is predominantly updates rather than inserts, Hudi's COPY_ON_WRITE UPSERT semantics may suit it better than alternatives.
- Fully managed Kafka shifts the cost from operations to product. Notion's explicit reasoning for choosing Confluent Cloud was that managing Kafka broker infrastructure was not a good use of engineering capacity. Two years in, they had upgraded their clusters only a few times. If your team's core competency is product rather than infrastructure, the operational savings from managed Kafka can be significant.
- Data residency compliance is easier to design in than to bolt on. Notion handles data residency by running entirely independent Kafka stacks per region rather than implementing cross-region filtering. If you are building multi-region infrastructure, structuring each region as a self-contained pipeline from the start avoids the complexity of message-level routing and filtering later.
- Track ingestion latency per table, not just aggregate throughput. Notion monitors end-to-end latency by table rather than relying on a single pipeline-wide metric. This lets the team quickly identify when a specific table is running behind without needing to diagnose the entire pipeline.
Sources and further reading
Primary sources
- XZ Tie, Nathan Louie, Thomas Chow, Darin Im, Abhishek Modi, Wendy Jiao, "How Notion build and grew our data lake to keep up with rapid growth" (1 July 2024)
- Ekanth Sethuramalingam, Adam Hudson (quoted), "How Notion Scales its AI and Lowers Operations Costs With Confluent" (Confluent, accessed May 2026)
- Justin Lee, Adam Hudson, "Enabling Multi-Region Data Systems at Notion" (9 April 2026)
Try Kpow with your Kafka cluster
If you are monitoring a Kafka cluster at any scale, you can try Kpow free for 30 days. It connects to any Kafka cluster in minutes and deploys via Docker, Helm, or JAR.