Sitemap - 2024 - VuTrinh.

Apache Airflow Overview

Netflix’s Trillions Scale Real-time Data Infrastructure

The Data Lake, Warehouse and Lakehouse

ETL and ELT

Apache Flink Overview

DoorDash's real-time processing system

We might not fully understand the column store!

How does Vortex, the BigQuery storage engine work behind the scenes?

I spent 4 hours learning the architecture of BigQuery's storage engine

AutoMQ: Achieving Auto Partition Reassignment In Kafka Without Cruise Control

I spent 4 hours learning how Netflix operates Apache Iceberg at scale.

How does Netflix ensure the data quality for thousands of Apache Iceberg tables?

I spent 8 hours relearning the Delta Lake table format

DataHub: The Metadata Platform Developed at LinkedIn

I spent 8 hours learning the ClickHouse MergeTree Table Engine

I spent 3 hours learning the overview of ClickHouse

I spent 3 hours learning how Uber manages data quality.

How AutoMQ Reduces Nearly 100% of Kafka Cross-Zone Data Transfer Cost

I spent 4 hours learning Apache Spark Resource Allocation

I spent 8 hours learning the details of the Apache Spark scheduling process.

I spent 6 hours learning Apache Arrow: Overview

I spent 5 hours exploring the story behind Apache Hudi.

I spent 8 hours researching WarpStream

Why Apache Spark RDD is immutable?

I spent 8 hours diving deep into Snowflake (again).

I spent 5 hours learning how Google lets us build a Lakehouse.

I spent 5 hours learning how ClickHouse built their internal data warehouse.

I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.

Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond

I spent 6 hours learning how Apache Spark plans the execution for us

A Quick Survey: What Topics Would You Like to Read About?

The Overview Of Apache Spark

Kubernetes for Data Engineers

I spent 7 hours diving deep into Apache Iceberg

How do we run Kafka 100% on the object storage?

I spent 8 hours learning Parquet. Here’s what I discovered

How did Discord evolve to handle trillions of data points

How did Facebook design their Real-Time Processing ecosystem

How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?

I spent 4 hours learning Apache Iceberg. Here's what I found.

How does Notion handle 200 billion data entities?

Diving Deep into LinkedIn's Data Infrastructure: My 6-Hour Learning & Key Takeaways

Netflix Data Engineer Stack

Apache Kafka - Consumer

Practical Data Engineering using AWS Cloud Technologies

Apache Kafka - Producer

GroupBy #44: Meta | The Data Stack

Apache Kafka - Important Designs

GroupBy #43: Uber | Kafka - The Tiered Storage

Apache Kafka - Overview

GroupBy #42: Paypal - Scaling Kafka

Procella - The query engine at YouTube

GroupBy #41: Uber’s Batch Data Infrastructure with Google Cloud Platform

How does Uber handle petabytes of Spark shuffle data every day?

GroupBy #40: Data Infrastructure at Airbnb

The Architecture of Apache Druid

GroupBy #39: 2000+ DBT models in airflow; Serverless Jupyter Notebooks at Meta

4 Trillion Events Daily at LinkedIn

GroupBy #38: Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform, Apache Iceberg - What Is It

Everything you need to know about MapReduce

GroupBy #37: Composable data management at Meta, How Uber Accomplishes Job Counting At Scale

How Twitter processes 4 billion events in real-time daily

GroupBy #36: Agoda- How We Solve Load Balancing Challenges in Apache Kafka, How to reduce your Snowflake cost

The Hadoop Distributed File System

GroupBy #35: The Netflix Data Engineering Stack, Atlassian - Evolve the data platform with a Deployment Capability

All you need to know about the Google File System

GroupBy #34: Hybrid Transactional/Analytical Storage, From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

I spent 5 hours understanding more about the Delta Lake table format

GroupBy #33: Data Gateway - A Platform for Growing and Protecting the Data Tier at Netflix, The Cloud Storage Triad: Latency, Cost, Durability

The stream processing model behind Google Cloud Dataflow

GroupBy #32: Canva - Scaling to Count Billions, Ensuring Precision and Integrity: A Deep Dive into Uber’s Accounting Data Testing Strategies

Do we need the Lakehouse architecture?

GroupBy #31: Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore, Grab Experiment Decision Engine

A Closer Look Into Databricks's Photon Engine

GroupBy #30: Uber- How LedgerStore Supports Trillions of Indexes, Composable Data Systems: Lessons from Apache Calcite Success

Why did Databricks build the Photon engine?

GroupBy #29: Scaling AI/ML Infrastructure at Uber, The Sisyphean struggle and the new era of data infrastructure

A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn

GroupBy #28: Tableflow - The Stream/Table, Kafka/Iceberg Duality, Kafka tiered storage deep dive

How does Uber build real-time infrastructure to handle petabytes of data every day?

GroupBy #27: Balancing HDFS DataNodes in the Uber DataLake, How Figma’s databases team lived to tell the scale

I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.

GroupBy #26: How GitHub uses merge queue to ship hundreds of changes every day, Data governance in the age of generative AI, "Good Enough" Data Models

If I could travel back to 5 years ago, what would I talk to myself about Docker?

GroupBy #25: From Samza to Flink: A Decade of Stream Processing, DoorDash’s In-House Search Engine,Meta's DotSlash, Designing Metrics Trees

I spent 7 hours reading another paper to understand more about Snowflake's internal. Here's what I found.

GroupBy #24: Enabling near real-time data analytics on the data lake at Grab, Aligning Velox and Apache Arrow at Meta.

I spent 4 hours figuring out how BigQuery executes the SQL query internally. Here's what I found.

GroupBy #23: Meta loves Python, How Uber Serves Over 40 Million Reads Per Second from Online Storage Using an Integrated Cache

I spent 3 hours figuring out how BigQuery inserts, deletes and updates data internally. Here's what I found.

GroupBy #22: Data Engineering Landscape in 2024, how I scaled my $1m/year revenue startup's data model

I spent another 6 hours understanding the design principles of Snowflake. Here's what I found

GroupBy #21: How to design resilient and large scale data systems, What Data Modeling is NOT

I made 1+1=0 in DuckDB

GroupBy #20: How Google takes the pain out of code reviews, The Difficulties of Senior Engineer are not Engineering

How Rust and Python manage memory

GroupBy #19: How Apple built iCloud to store billions of databases, Palette-Uber feature store, Definition of Data Modeling

I spent 6 hours understanding the design principles of BigQuery. Here's what I found

GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI

You don't know this for sure: How BigQuery stores semi-structured data?

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

Lesson learned after reading the BigQuery academic paper: Shuffle operation

Referral program and things you can expect from this newsletter

GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage