3 Comments
Apr 29Liked by Vu Trinh

You write "This two-tier data lake + warehouse architecture dominated the industry at the time of the paper’s writing".

Why do you think this is true? Perhaps it's true for large FAANG companies but very few of the SMBs I've worked with have more than a warehouse. The only data I could find on this was a Dremio 'State of the Data Lakehouse' survey but I suspect there's a large amount of selection bias going on since Dremio is a data lakehouse vendor. Even if if it's the case that a large amount of companies use a lakehouse, saying the architecture 'dominates' strikes me as hyperbole at best and inaccurate at worst.

Thanks for the article!

Expand full comment
author

Thank you for your comment. I'm currently interpreting your point that you think I'm saying the Lakehouse paradigm was dominant with the sentence: "This two-tier data lake + warehouse architecture dominated the industry at the time of the paper’s writing" right? Or did I miss something? Because when I wrote that sentence down, I wanted to say that the architecture in which we landed the data was first in the lake and then "ETL" to the warehouse, not the Lakehouse architecture. Am I understand your point right? One more thing: I really appreciate it when you reach out and start a discussion like this; thank you once again. If you want to discuss this further, feel free to drop other comments or DM me if you think that way is more convenient.

Expand full comment

I love the story , My experience on this is really amazing. I have seen some more article around it which is also a great resources for Data Engineers.

Key Components of Data Lake Architecture

Data Lakes allow organizations to save a lot of work and time that is usually invested in creating a data structure. This enables fast ingestion and data storage.

Here are a few key components of a robust and effective Data Lake Architectural model:

Governance: This is vital in order to measure performance and improve Data Lake through monitoring and supervising operations.

Security: This is a key component to keep in mind during the initial phase of architecture. This is different from the security measures deployed for Relational Databases.

Metadata: Metadata means data that is referring to other data. For e.g. reload intervals, schemas, etc.

Stewardship: Depending on the organization, this role can be assigned to either the owners or a specialized team.

Monitoring & ELT Processes: A tool is required to organize the flow of data that is moving from the Raw layer through the Cleansed layer to the Sandbox and Application Layer since one might need to apply transformations to the data.

https://hevodata.com/learn/data-lake-architecture-a-comprehensive-guide/

Expand full comment