Sitemap - 2024 - VuTrinh.
Netflix’s Trillions Scale Real-time Data Infrastructure
The Data Lake, Warehouse and Lakehouse
DoorDash's real-time processing system
We might not fully understand the column store!
How does Vortex, the BigQuery storage engine work behind the scenes?
I spent 4 hours learning the architecture of BigQuery's storage engine
AutoMQ: Achieving Auto Partition Reassignment In Kafka Without Cruise Control
I spent 4 hours learning how Netflix operates Apache Iceberg at scale.
How does Netflix ensure the data quality for thousands of Apache Iceberg tables?
I spent 8 hours relearning the Delta Lake table format
DataHub: The Metadata Platform Developed at LinkedIn
I spent 8 hours learning the ClickHouse MergeTree Table Engine
I spent 3 hours learning the overview of ClickHouse
I spent 3 hours learning how Uber manages data quality.
How AutoMQ Reduces Nearly 100% of Kafka Cross-Zone Data Transfer Cost
I spent 4 hours learning Apache Spark Resource Allocation
I spent 8 hours learning the details of the Apache Spark scheduling process.
I spent 6 hours learning Apache Arrow: Overview
I spent 5 hours exploring the story behind Apache Hudi.
I spent 8 hours researching WarpStream
Why Apache Spark RDD is immutable?
I spent 8 hours diving deep into Snowflake (again).
I spent 5 hours learning how Google lets us build a Lakehouse.
I spent 5 hours learning how ClickHouse built their internal data warehouse.
I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.
Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond
I spent 6 hours learning how Apache Spark plans the execution for us
A Quick Survey: What Topics Would You Like to Read About?
I spent 7 hours diving deep into Apache Iceberg
How do we run Kafka 100% on the object storage?
I spent 8 hours learning Parquet. Here’s what I discovered
How did Discord evolve to handle trillions of data points
How did Facebook design their Real-Time Processing ecosystem
How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
I spent 4 hours learning Apache Iceberg. Here's what I found.
How does Notion handle 200 billion data entities?
Diving Deep into LinkedIn's Data Infrastructure: My 6-Hour Learning & Key Takeaways
Practical Data Engineering using AWS Cloud Technologies
GroupBy #44: Meta | The Data Stack
Apache Kafka - Important Designs
GroupBy #43: Uber | Kafka - The Tiered Storage
GroupBy #42: Paypal - Scaling Kafka
Procella - The query engine at YouTube
GroupBy #41: Uber’s Batch Data Infrastructure with Google Cloud Platform
How does Uber handle petabytes of Spark shuffle data every day?
GroupBy #40: Data Infrastructure at Airbnb
The Architecture of Apache Druid
GroupBy #39: 2000+ DBT models in airflow; Serverless Jupyter Notebooks at Meta
4 Trillion Events Daily at LinkedIn
Everything you need to know about MapReduce
GroupBy #37: Composable data management at Meta, How Uber Accomplishes Job Counting At Scale
How Twitter processes 4 billion events in real-time daily
The Hadoop Distributed File System
All you need to know about the Google File System
I spent 5 hours understanding more about the Delta Lake table format
The stream processing model behind Google Cloud Dataflow
Do we need the Lakehouse architecture?
A Closer Look Into Databricks's Photon Engine
Why did Databricks build the Photon engine?
A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn
GroupBy #28: Tableflow - The Stream/Table, Kafka/Iceberg Duality, Kafka tiered storage deep dive
How does Uber build real-time infrastructure to handle petabytes of data every day?
I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.
If I could travel back to 5 years ago, what would I talk to myself about Docker?
I spent 4 hours figuring out how BigQuery executes the SQL query internally. Here's what I found.
I spent another 6 hours understanding the design principles of Snowflake. Here's what I found
GroupBy #21: How to design resilient and large scale data systems, What Data Modeling is NOT
How Rust and Python manage memory
I spent 6 hours understanding the design principles of BigQuery. Here's what I found
GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI
You don't know this for sure: How BigQuery stores semi-structured data?
Lesson learned after reading the BigQuery academic paper: Shuffle operation
Referral program and things you can expect from this newsletter
GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage
