My name is Vu Trinh, and I am a data engineer.
I’m trying to make my life less dull by spending time learning and researching “how it works“ in the data engineering field.
Here is a place where I share everything I’ve learned.
Not subscribe yet? Here you go:
Intro
If you've used Notion, you know it lets you do almost everything—note-taking, planning, reading lists, and project management.
Notion isn't rigid; it allows you to customize things until you feel good.
Everything in Notion is a block—text, images, lists, database rows, and even pages.
These dynamic units can be transformed into other block types or moved freely within Notion.
Blocks are Notion's LEGOs.
Postgres ruled them all.
Initially, Notion stored all the blocks in the Postgres database.
In 2021, they had more than 20 billion blocks.
Now, the blocks have grown to more than two hundred billion entities
Before 2021, they put all the blocks in a single Postgres instance.
Now, they shard the database into 480 logical shards and distribute them over 96 Postgres instances, each responsible for 5 shards.
At Notion, Postgres databases handle everything from online user traffic to offline analytics and machine learning.
Recognizing the explosive demands of analytics use cases, especially their recent Notion AI features, they decided to build a dedicated infrastructure for the offline workload.
Fivetrans and Snowflake
In 2021, they started the journey with a simple ETL that used Fivetran to ingest data from Postgres to Snowflake, using 480 connectors to write 480 shards to raw Snowflake tables hourly.
Then, Notion would merge these tables into one big table for analytics and machine learning workload.
But this approach had some problems when the Postgres data grew:
Managing 480 Fivetran connectors is a nightmare.
Notions users update blocks more often than add new ones. This heavy-updated pattern slows and increases the cost of Snowflake data ingestion.
The data consumption gets more complex and heavy (AI workloads)
Notion embarked on building their in-house data lake.
The Lake
They want to build a solution that provides the following:
Scalable data repository for storing both raw and processed data.
Fast and cost-efficient data ingestion and computation for any workload. Especially with their update-heavy block data.
In 2022, they onboarded an in-house data lake architecture that incrementally ingested data from Postgres to Kafka using Debezium, then used Apache Hudi to write from Kafka to S3.
The object storage will act as the endpoint for consumed systems, serving analytics, reporting needs, and AI workloads.
They used Spark as their primary data processing engine to handle billions of blocks on the top of the lake.
Offloading data ingestion and computing workload from Snowflake helps them reduce costs significantly.
The changes from Postgres are captured by Kafka Debezium Connector and then written to S3 via Apache Hudi.
Notion chose this table format because it performs well with its update-heavy workload and native integration with Debezium CDC messages.
Here is a brief on how they operate the solutions:
One Debeizum CDC connector per Postgres host.
Notion deployed these connectors on managed Kubernetes on AWS (EKS)
The connector can handle tens of MB/sec of Postgres row changes.
One Kafka topic per Postgres table.
All connectors will consume from all 480 shards and write to the same topic for that table.
They use Apache Hudi Deltastreamer, a Spark-based ingestion job, to consume Kafka messages and write data to S3.
Most data processing jobs were written in PySpark.
They use Scala Spark for more complex jobs. Notion also leverages multi-threading and parallel processing to speed up the processing of 480 shards.
The payoff
Offloading data from Snowflake to S3 saved Notion over a million dollars in 2022, with even more significant savings in 2023 and 2024.
The overall ingestion time from Postgres to S3 and Snowflake reduced significantly, dropping from over a day to just a few minutes for small tables and a couple of hours for larger ones.
The new data infrastructure unlocks a more advanced analytics use case and product, enabling the successful rollout of Notion AI features in 2023 and 2024.
Outro
Thank you for reading to the end. As I delve deeper into how big tech companies build and manage their data analytics infrastructure, I look forward to sharing valuable lessons from my journey. See you in future posts!
If you enjoy this article, please like and restack it to help more people find it;)
Now it’s time to consume some cool links I found last week ;)
References
[1] XZ Tie, Nathan Louie, Thomas Chow, Darin Im, Abhishek Modi, Wendy Jiao, Building and scaling Notion’s data lake (2024)
📋 The list
✏️ Data products = the future of data engineering
4 minutes, by
In this article we will talk about: (1) The fine line between machine learning and data engineering. (2) What is a data product and how can data engineers up skill to deliver these products
✏️ Can You Even __init__.py?
6 minutes, by Louis Chan
Whenever you try to import your code from a different folder, you throw in an empty __init__.py. But do we really know __init__.py?
✏️ The Top 10 Data Lifecycle Problems that Data Engineering Solves
14 mins, by Mike Shakhomirov
In this article, I want to tackle some of the biggest challenges data engineers face when working with pipelines throughout the data lifecycle.
✏️ How to implement data quality checks with Great Expectations
8 minutes, by Start Data Engineering
By the end of this post, you will have a mental model of how the Great Expectations library works and be able to quickly set up and run your own data quality checks with Great Expectations.
✏️ Airbnb | Apache Flink® on Kubernetes
10 minutes, by Airbnb Tech Blog
In this blog post, we will delve into the evolution of Flink architecture at Airbnb and compare our prior Hadoop Yarn platform with the current Kubernetes-based architecture.
✏️ Delivering Faster Analytics at Pinterest
6 minutes, by Pinterest Engineering
In this blog post, we’ll discuss and share our experience of launching our Analytics app on StarRocks.
Before you leave
If you want to discuss this further, please leave a comment or contact me via LinkedIn, Email, or Twitter.
It might take you five minutes to read, but it took me more than five days to prepare, so it would greatly motivate me if you considered subscribing to receive my writing.
Hey Trinh,
How they use the processed data from s3 ? Specifically which vector or Relational database ?
Just a great read! Boiling complex systems down to the perfect level for understanding and communication.