VuTrinh.

VuTrinh.

A framework I use to build a data pipeline.

You cannot simply say, "I'll use Spark, Kafka, and so on"; you need to ask clarifying questions to gather information for proposing a robust data pipeline.

Vu Trinh's avatar
Vu Trinh
Dec 02, 2025
∙ Paid

The 50% discount on the yearly package ends in ONE day. Don’t miss it.

I will publish a paid article every Tuesday. I wrote these with one goal in mind: to offer my readers, whether they are feeling overwhelmed when beginning the journey or seeking a deeper understanding of the field, 15 minutes of practical lessons and insights on nearly everything related to data engineering.


Intro

In last week’s article, I showed you how to destroy your data pipeline in the most miserable way possible. That was fun to write. I also said that building a robust data pipeline is not easy; there are many things to get clear before starting the process.

However, in an interview or in your daily job, when we were asked to build a data pipeline, we couldn’t say, “It’s hard”; we had to propose a pipeline that actually worked. So I wonder: is there a framework (a set of questions) we could use to gather more information, thereby making the pipeline design and development process more manageable?

In this article, I will list my go-to questions when building a data pipeline. Each question will include the information you are expected to receive.


Before we move on

The ultimate goal of any data pipeline is to move data from location A to location B. During the move, some transformations are applied to make the data at location B applicable to the business.

So, in this article, I'll categorize the questions into three sections: source, sink, and middle steps. Thinking this way helps me separate each component's concerns so I can plan better for the pipeline.

These questions are based solely on my current experience and knowledge, so they might not cover all the aspects. If you feel I’m missing something, feel free to comment.


Sink

When building a pipeline, we should begin from the sink. More accurately, we should start from the end users.

Does this data pipeline serve any business purpose?

If it doesn’t serve any purpose, our pipeline is redundant? In the real world, there is a high chance you will have a pipeline that's forgotten a week later because it doesn’t support any business process. Asking this question can help you save time by skipping a useless data pipeline.

Does your company have a data model?

This is a critical question as the data model defines many things, from how your output will be constructed to how data quality rules will be applied.

If yes, excellent, follow the data model. If the pipeline loads data to some dims/facts or calculates a metric derived from them, things are simple. However, if you have to deal with a new business flow, you need to work with a business user to model it.

If not, you have to think about data modeling first, not the entire company data model, but at least modeling entities that are related to your building pipeline, and expand it incrementally later. That ensures you still deliver the data pipeline while leaving the door open for data modeling, one of the most critical factors in a company’s data foundation.

What is the shape of the output?

After clarifying the modeling, it will be easier to define the expected output fields. If needed, this question should be answered with the help of business users to clarify which data fields will be included in the output.

How will the output be served?

A table, a dashboard, a CSV file, exposing via API, exposing via web-app, or an ML training dataset. This will help you prepare better the infrastructure to deliver the output efficiently: Do I need to develop a set of APIs? Where do I store CSV files? How do I expose the tables?

How old can the data be before it is considered stale?

The 50% discount on the yearly package ends in ONE day. I invite you to upgrade your subscription to access my high-quality, human-written data engineering articles.

Keep reading with a 7-day free trial

Subscribe to VuTrinh. to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Vu Trinh · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture