GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management
Plus: uber data analytics side project, dbt learning resource
This is GroupBy, the place where I share with you guys the resources I learn from people smarter than me in data engineer field.
Not subscribed yet? Here you go:
👋 Hi, my name is Vu Trinh, a data engineer.
I enjoy reading good stuff (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet.
Hope this issue find you well.
🎯 Side Project
40+ hours of debugging and you still want some more?
To get your hand dirty (more), this week I will bring you a project:
🚖 Uber Data Analytics | End-To-End Data Engineering Project
The goal of this project is to perform data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio.
— From Author’s Github repo —
Suggestions from me to get life harder
Self-learn data modeling, concept like scd type, Kimball data modeling approach, different between Kimball and Inmon approach,…
🐙 Learning resource
I love to learn, and I assume you do too.
dbt, a popular tool for abstraction transforming and modeling data.
Learning dbt is essential for streamlining data processes, ensuring data quality, and accelerating analytics development, making it a valuable skill for anyone involved in data analysis and management.
Here some (FREE) learning resource:
🎓 | dbt Fundamentals
Learn the Fundamentals of dbt including modeling, sources, testing, documentation, and deployment. (approximately 5 hours)
🎓 | Jinja, Macros, Packages
Extend the functionality of dbt with Jinja/macros and leverage models and macros from packages. (approximately 2 hours)
🎓 | Advanced Materializations
Learn about the advanced materializations built into dbt Core - ephemeral models, incremental models, and snapshots. (approximately 2 hours)
🎓 | Refactoring SQL for Modularity
Learn with the analytics engineers of dbt Labs how to migrate legacy transformation code into modular dbt data models. Useful if you're porting stored procedures or SQL scripts into your dbt project. (approximately 3.5 hours)
🎓 | Advanced Testing
Learn more about the theory of data testing and the practice of creating custom generic tests, leveraging tests in packages, and applying test configurations. (approximately 4 hours).
Approximately 16.5 hours for you to understand that “dbt is not just a SQL generator“
🚀 Engineering
Engineering is the practice of using natural science, mathematics, and the engineering design process to solve technical problems, increase efficiency and productivity, and improve systems. — wikipedia
📖┆Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0
📖┆Iceberg and Hudi ACID Guarantees┆Tablular
In this post, I make the case that Iceberg is reliable and Apache Hudi is not.
📖┆Running Unified PubSub Client in Production at Pinterest
In a distributed PubSub environment, complexities related to client-server communication can often be hard blockers for application developers, and solving them often require a joint investigation between the application and platform teams.
📖┆How we Built the Ingestion Framework┆OpenMetadata
Without metadata, there are no discovery, collaboration, or quality tests. The ingestion process is a requirement that unlocks the rest of the features, and we are constantly pushing for improvements.
📖┆Scheduling Jupyter Notebooks at Meta
At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing data ETL) that need to be repeated at certain intervals.
✏ Data
The one thing that this job has taught me is that truth is stranger than fiction.
— Predestination (2014)
📖┆Data Driven Management: The Why, Who, What and How?
📖┆Going All-In On Data Quality
A principle that I think is useful to follow when it comes to data quality is the idea of staging tables
📖┆Data Quality ≠ Data Trust: Bridging the Data Trust Gap
✍ Prukalpa
A broken pipeline. A source system gone down. A change made to a column name. Three unique root causes, but the same end result: broken trust.
📖┆The Clash Between Data Quality and AI: Unisphere’s Latest Findings
Data quality issues have been a looming threat for any and all enterprises, often surfaced by the proliferation of new data analytics and AI projects that, incidentally, rely on good data to succeed.
📖┆5 Signs That Your Data is Modeled Poorly
To be able to model your teams data properly, you need to be able to conceptualize relevant business entities and organize them in a way that is conducive to common questions asked within your organization.
🤖 AI┆ML┆Data Science
You know, Burke, I don’t know which species is worse.
— Ripley, Aliens (1986)
📖┆The architecture of today’s LLM applications┆GitHub
📖┆What I’m Reading on the Rise of Artificial Intelligence
…I wanted to share some of the books, articles, and podcasts that have helped shape my perspective over the past year. This list offers a range of viewpoints on the threats, opportunities, and challenges posed by AI and some thoughtful ideas on how to respond.
📖┆AI ‘breakthrough’: neural net has human-like ability to generalize language
Scientists have created a neural network with the human-like ability to make generalizations about language.
📖┆Harvard professor Lawrence Lessig on why AI and social media are causing a free speech crisis for the internet
After 30 years teaching law, the internet policy legend is as worried as you’d think about AI and TikTok — and he has surprising thoughts about balancing free speech with protecting democracy.
🔥 Catch up
…Next Saturday night, we're sending you back to the future!
— Dr. Emmett Brown, Back to the Future (1985)
[📖] Airflow┆Release of Airflow 2.7.3
[📖] BigQuery┆Work with text analyzers
[📖] Spark┆Arrow-optimized Python UDFs in Apache Spark™ 3.5
[📖] Google Cloud┆Cloud Functions now supports the Python 3.12 runtime.
[📖] Snowflake┆Search Optimization: Support for Substring Search in Semi-Structured Data
🚨 The next section contain my own writing. Don't blame me if you feel distressed after reading this; you chose to read it, although you can skip without thinking twice.
🥷 It will steal 97 seconds from you
Random thoughts, ideas.
The hardest truth I’ve learned as a data engineer is this: No matter how fancy your pipeline or infrastructure is, if your data foundation doesn't have the ability to support the business, everything you do is just 💩.
You put in all your effort to deliver an internal tool to support analytics, but nobody uses it.
Your tool is 💩.
You tune your SQL script to run 2.5x faster, but the data output is “wrong” and leads to “really bad“ decisions.
Your SQL script is 💩.
The lesson here is that anything you do, if you want it to bring value (so that you can lead a meaningful life), make sure it can help your “customer” solve problems.
Put yourself in your customer’s shoes.
Before developing an internal tool, sit down and talk to your DAs and DSs.
When developing a data pipeline, talk to the business to help define “constraints” and “rules” to control the quality and correctness of your data.
So, to apply this lesson and save this newsletter from being 💩…
…I need you…
…yes, you, the “customers” of this newsletter.
I need your feedback on which aspects I need to improve and things that you expect from this newsletter to help me grow as a DE.
(In the comment section or directly contact me through my mail or linkedIn)
I will adjust my work.
Promise. (Unless your ideas is too “wild”)
Switching the context between “your DE work is 💩 if … “ to “I need your feedback“ is… weird."