GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage
Plus: Data modeling side project, Data Engineer roadmap 2024.
This is GroupBy, the place where I share with you guys the resources I learn from people smarter than me in data engineer field.
Not subscribed yet? Here you go:
👋 Hi, my name is Vu Trinh, a data engineer.
I enjoy reading good stuff (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet.
Hope this issue find you well.
🥷 It will steal 37 seconds from you
NEWSLETTER UPDATE.
FOR READER THAT ALREADY SUBSCRIBED:
THIS UPDATE WILL NOT AFFECT YOU READING EXPERIENCE AND NUMBER OF EMAIL YOU WILL RECEIVE WEEKLY.
You still receive only ONE EMAIL EVERY WEEK:
The GROUPBY WEEKLY issue.
(like the one you’re reading)
From beginning of 2024, I will launch a sub-newsletter with co-exist with this newsletter . This mean my newsletter will contain two sub-newsletter:
GroupBy.
Weekly compiled resource of data engineer (like the one you’re reading).
Every Tuesday
Dimensions.
My blog-style writing about what I've learned in data engineering field.
Every Saturday
Subscriber who subscribed:
Before 2024 January 06, will receive emails only from GroupBy.
After 2024 January 06, will receive emails both from GroupBy. and Dimensions.
Subscriber have the control over which newsletter they want to receive:
Access this link: https://vutr.substack.com/account
Toggle ON or OFF to choose which newsletter would like to receive the email:
FOR READER THAT ALREADY SUBSCRIBED:
You’re having option “Dimensions“ being turned OFF.
→ You will only receive email from GroupBy.
🎯 Side Project
40+ hours of debugging and you still want some more?
📖┆Data Modeling Project: Design For Global Superstore Sales
This project's central goal is creating a structured database design that includes a central table of facts and the required dimension tables to establish connections between different elements. This will enable meaningful comparisons and analysis.
I am always looking for a data modeling project. Finally, I found one.
🐙 Learning resource
I love to learn, and I assume you do too.
🎓┆The Ultimate Roadmap for Data Engineers in 2024
In this blog, we'll reveal the layers of the ultimate roadmap for eager newcomers through the essential skills that define the data engineering.
I agree with most steps in this roadmap; just want to add data modeling and dbt into it.
🚀 Engineering
I have to believe in a world outside my own mind. — Memento (2000)
📖┆Understanding Parquet, Iceberg and Data Lakehouses at Broad
I've heard a lot about Avro, Parquet, ORC, Arrow and Feather, but I also keep hearing about Iceberg and Delta Lake. As a "database person", I’ve been struggling to understand all of these different things, and how they relate to Data Lakes and Data Lakehouses (and what exactly are these?). So, I’ve decided to study them, and consolidate my knowledge in writing.
📖┆Deployment of Exabyte-Backed Big Data Components
In this post, we'll explain how we built our RU (rolling update) framework to power a frictionless deployment experience on a large-scale Hadoop cluster, achieving a >99% success rate free from interruptions or downtime and reducing significant toil for our SRE and Dev teams.
📖┆uVitals - An Anomaly Detection & Alerting System
But what about the long tail of issues that lurk in the shadows, sometimes remaining undetected until they cause chaos? For these, traditional strategies may not suffice.
This is where uVitals steps onto the stage, ready to seize the opportunity to detect sooner and detect more.
📖┆Apache Airflow at Adyen: Our journey and challenges to achieve reliability at scale
In this blogpost, we shared a few challenges that we encountered while aiming to achieve reliability at scale at Adyen with Airflow.
📖┆3 years managing Kubernetes clusters, my 10 lessons.
In this article, I wish to share with you the ten most valuable lessons I've learned as a Kubernetes cluster manager.
✏ Data
The one thing that this job has taught me is that truth is stranger than fiction. — Predestination (2014)
📖┆Super Tables: The road to building reliable and discoverable data products
Super Tables (ST) are pre-computed, denormalized, and consistently consolidated attributes and insights of entities or events that are optimized for common and efficient analytic use cases.
📖┆How to plan to data roadmap for 2024 - elevating your data strategy
...I wanted to provide some tips to help those either in leadership positions or who want to break into these positions plan out their data roadmap for 2024.
📖┆The many layers of data lineage
In this post we’ll discuss how we can learn from the field of cartography and Google Maps to extract the untapped potential of data lineage, and build this ideal interface to improve data literacy and observability.
📖┆Discovery and Consumption of Analytics Data at Twitter
In this blog, we will discuss the higher-level design and usage of of Data Access Level, how it fits in within the overall data platform ecosystem, and share some observations and lessons learned.
🤖 AI┆ML┆Data Science
You know, Burke, I don’t know which species is worse. — Ripley, Aliens (1986)
📺┆[1hr Talk] Intro to Large Language Models
And so now, we return to the original question that took us down this long and winding path - should we even care about connecting enterprise data to natural language queries by LLMs?
📖┆How To Train Your Own GenAI Model
If I was to summarize the goal of this article, it's that we're going to learn to light a campfire with a lighter (GPT2) and not a flamethrower (GPT3.5).
📖┆Running demand forecasting machine learning models at scale
This blog post delves into the learnings and challenges on our journey towards implementing and scaling state-of-the-art deep learning approaches. We’ll shed light on how to use the newest machine-learning approaches in a controlled and reliable manner.
📖┆Airbnb at KDD 2023
Airbnb had a significant presence at KDD 2023 with two papers accepted into the main conference proceedings and 11 talks and presentations. In this blog post, we’ll summarize our team’s contributions and share highlights from an exciting week of research talks, workshops, panel discussions, and more.
📖┆Monte Carlo, Puppetry and Laughter: The Unexpected Joys of Prompt Engineering
This article will be an exploration of prompt techniques we’ve used for our internal productivity tooling at Instacart.
Let me here your voice, for example:
'Your newsletter is so terrible, I can't handle it anymore.'