GroupBy #19: How Apple built iCloud to store billions of databases, Palette-Uber feature store, Definition of Data Modeling

Plus: Machine Learning Pipelines with Airflow and Mlflow, How to craft the perfect data engineer resume

Jan 23, 2024

This is GroupBy, where I share the resources I learn from people smarter than me in the data engineering field.

Not subscribed yet? Here you go:

Available for iOS and Android

👋 Hi, my name is Vu Trinh, a data engineer.
I enjoy reading good stuff (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet.
Hope this issue find you well.

🎯 Side Project

40+ hours of debugging and you still want some more?

📖┆Build Machine Learning Pipelines with Airflow and Mlflow: Reservation Cancellation Forecasting

✍ Jeremy Arancio

Learn how to create reproducible and ready-for-production Machine Learning pipelines through a Senior Machine Learning assignment

📈 Career

Don't let comfort hold you back.

📖┆How to craft the perfect data engineer resume and LinkedIn profile in 2024

✍ Zach Wilson

2024 is a time to stay nimble in your employee journey and remember that no job is so good that you’re immune to being laid off!

🐙 Learning resource

I love to learn, and I assume you do too.

📖┆Ten new generative AI trainings to upskill in 2024 with Duet AI

Check out our recommended top ten list of short trainings available on Duet AI, for developers, data analysts, cloud engineers, architects, security engineers, and Workspace users.

🚀 Engineering

I have to believe in a world outside my own mind. — Memento (2000)

📖┆Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

✍ Aditya Hegde

With our new data processing framework, we were able to observe a multitude of benefits, including 99.9% request success rates, 78% reduction in customer escalations, and automatic recovery from transient errors. In this post, we will cover unique challenges we faced, our solutions design and architecture, the tech stack used, and the performance results we achieved.

📖┆The Evolution of Enforcing our Professional Community Policies at Scale

✍ Amit Mathapati

In this blog post, we'll go deeper into how we manage account restrictions. We'll talk about the changes we've made over the years to keep up with LinkedIn's growth and scale our infrastructure quickly.

📖┆How Apple built iCloud to store billions of databases

✍ Engineer’s Codex

Apple uses Cassandra and FoundationDB for CloudKit, their cloud backend service. We take a look into how exactly each is used within their cloud and the problems they've solved.

📖┆Shallow Copy For Data: What Are Your Options?

✍ Idan Novogroder

Keep reading to learn more about the concept of data shallow copy and dive into the use cases from Databricks Delta Lake, Iceberg, Snowflake, and lakeFS.

📖┆Advice to my younger self and you after 20 years in programming

✍ Alexey Inkin

In the first part, I will briefly describe my career for the context. In the second part, I will go through each separate piece of advice that I think would have the strongest impact.

📖┆Lazy is the new fast: How Lazy Imports and Cinder accelerate machine learning at Meta

✍ Germán Méndez Bravo

At Meta, we've been able to significantly improve our model training times, as well as our overall developer experience (DevX) by adopting Lazy Imports and the Python Cinder runtime.

📖┆Continuous Integration

✍ Martin Fowler

I rewrote this article again in 2023 to better address the development teams of that time, with twenty years of experience to confirm the value of Continuous Integration.

📖┆Scalable OLTP in the Cloud: What’s the BIG DEAL?

✍ Murat Demirbas

The motivating question behind this work is: 'What are the asymptotic limits to scale for cloud OLTP (OnLine Transaction Processing) systems?'

📖┆The Scary Thing About Automating Deploys

✍ Sean McIlroy

But what does continuous deployment mean when you’re looking at 150 changes on a normal day?

✏ Data

The one thing that this job has taught me is that truth is stranger than fiction. — Predestination (2014)

📖┆My Definition of Data Modeling (for today)

✍ Joe Reis

What is a data model? I like to ask this question during my conference talks, and answers are all over the place. I’ve never seen a group of people consistently give a single definition. Before I give my working definition, let’s look at a few ways data modeling is defined by some notable experts.

📖┆Measuring data quality: bringing theory into practice

✍ Mikkel Dengsøe

If you're like most people, you don't want to measure data quality for the fun of it. Instead, you have a clear business need, e.g.,

📖┆Introduction to Data Modeling - 2024 Guide With Problems

✍ Deepanshu tyagi

Data modeling is the process of creating the conceptual representation of data and its relationship within an organisation.

📖┆Data-Driven Proptech: GoodData - Breakthrough in Room Utilization Analytics

✍ Jan Panský

This article delves into a specific Proptech use case, painting a vivid picture of gathering data through strategically placed sensors, processing it meticulously in a robust data pipeline, and ultimately leveraging GoodData’s advanced tools to craft insightful visualizations that redefine how we perceive and optimize workspace environments.

📖┆Every data transform is technical debt

✍ Andrew Jones

The only solution is to reduce the amount of data transformations we do.

📖┆Using Data to Find Growth Levers

✍ Ergest Xheblati

I recently read an article where the DuoLingo team reignited user growth to the tune of 350% so I decided to turn it into a case study of how to use data to find growth levers in your business.

🤖 AI┆ML┆Data Science

You know, Burke, I don’t know which species is worse. — Ripley, Aliens (1986)

📖┆Palette Meta Store Journey

✍ Uber Engineering Blog

The Uber Michelangelo feature store, called Palette, is a database of Uber-specific curated and internally crowd-sourced features that are easy to use in machine learning projects.

📖┆New tool, dataset help detect hallucinations in large language models

✍ Xiangkun Hu, Dongyu Ru

Representing facts using knowledge triplets rather than natural language enables finer-grained judgments.

📖┆A developer’s second brain: Reducing complexity through partnership with AI

✍ Eirini Kalliamvakou

As we look to empower developers with AI tools, we inadvertently integrate AI deeper into the way developers work. How do developers feel about that? And what are the most impactful ways to introduce more AI into workflows? We recently conducted 25 in-depth interviews with developers to understand exactly that.

📖┆Solving the weekly menu puzzle: recommendations at Picnic

✍ Giorgia Tandoi

With that goal in mind, we have recently introduced a brand new recommender algorithm, and in this blog post, we’ll take you behind the scenes: revealing how we do it, what factors we consider, our plans for future enhancements and, most importantly, which lessons we learned.

📖┆An “AI Breakthrough” on Systematic Generalization in Language?

✍ Melanie Mitchell

...this is a very interesting proof-of-principle paper on systematic generalization in neural networks.

🔥 Catch up

…Next Saturday night, we're sending you back to the future! — Dr. Emmett Brown, Back to the Future (1985)

📖┆BigQuery | Cross-cloud joins to run queries that span both Google Cloud and BigQuery Omni regions.

You can use GoogleSQL JOIN operations to analyze data across many different storage solutions, such as AWS, Azure, public datasets, and other Google Cloud services. Cross-cloud joins eliminate the need to copy data across sources before running queries.