Do we need the Lakehouse architecture?
When data lakes and data warehouses are not enough.
My name is Vu Trinh, and I am a data engineer.
I’m trying to make my life less dull by spending time learning and researching “how it works“ in data engineering.
Here is a place where I share everything I’ve learned.
Not subscribe yet? Here you go:
Table of contents
Challenges and Context
The Motivation
The Lakehouse Architecture
Intro
I first heard about the term “Lakehouse“ in 2019 while scrolling through the Dremio document. With a conservative mind, I assumed this was just another marketing term. Five years later, it seems like everybody is talking about Lakehouse (after they finish discussing AI :d); all major cloud data warehouses now support reading Hudi, Iceberge, or Delta Lake format directly in object storage, and even BigQuery has a dedicated query engine for this task. The innovation doesn’t stop there; Apache XTable (formerly OneTable) provides abstractions and tools for translating Lakehouse table format metadata. Recently, Confluent has announced the release of TableFlow, which feeds Apache Kafka data directly into the data lake, warehouse, or analytics engine as Apache Iceberg tables. This makes me re-examine my assumption from the past: Was Lakehouse just a marketing term?
This week, we will answer that question with my note from the paper Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics.
Challenges and Context
The impression of the data flow from data lakes to the data warehouse made me believe that the data lakes concept existed before the warehouse concept. This is not true. “Data Lakes” was coined in 2011, and “Data Warehouse“ was introduced long ago.
Data warehousing was first introduced to help business users get analytical insights by consolidating data from operational databases in a centralized warehouse. Analytic users use this data to support business decisions. Data would be written with schema-on-write to ensure the data model was optimized for BI consumption. This is the first-generation data analytics platform.
In the past, organizations typically coupled computing and storage to build data warehouses on-premise. This forced enterprises to pay for more hardware when the demand for analytics and data size increased. Moreover, data does not only come in tabular format anymore; it can be video, audio, or text documents. The unstructured data caused massive trouble for the warehouse system, which was designed to handle structured data.
Second-generation data analytics platforms came to the rescue. People started putting all the raw data into data lakes, low-cost storage systems with file interface that holds data in open formats, such as Apache Parquet, CSV, or ORC. This approach started with the rise of Apache Hadoop, which used HDFS for storage. Unlike data warehousing, data lake was a schema-on-read architecture that allowed flexibility in storing data. Still, it caused some challenges to data quality and governance. The approach would move a subset of data in the lake to the warehouse (ETL). The data-moving process ensures the analytics user can leverage the power of the data warehouse to mine valuable insights. From 2015 onwards, cloud object storage, such as S3 or GCS, started replacing HDFS. They have superior durability and availability, plus extremely low cost. The rest of the architecture is mostly the same in the cloud era for the second-generation platform, with a data warehouse such as Redshift, Snowflake, or BigQuery. This two-tier data lake + warehouse architecture dominated the industry at the time of the paper’s writing (I guess it has dominated till now). Despite the dominance, the architecture encounters the following challenges:
Reliability: Consolidating the data lake and warehouse is difficult and costly, requiring much engineering effort to ETL data between the two systems.
Data staleness: The data in the warehouse is stale compared to the lake’s data. This is a step back from the first-generation systems, where new operational data was immediately available for analytics demands.
Limited support for advanced analytics: Machine learning systems, such as TensorFlow or XGBoost, must process large datasets using complex programmatic code. Reading this data via ODBC/JDBC is not a good idea, and there is no way to access the internal warehouse data formats directly. Warehouse vendors recommend exporting data to files, which further increases the complexity of the whole system. Instead, users can run these systems on data lake data with open formats. However, this approach will trade-off for rich management features from data warehouses, such as ACID transactions or data versioning.
Total cost of ownership: In addition to paying for ETL pipelines, users are billed double the storage cost for data duplication on the data lake and data warehouse.
Note from me: The point "Limited support for advanced analytics" does not reflect the reality at the moment due to the intense support of major cloud data warehouses like BigQuery, Snowflake, or Redshift for the machine learning workload. Feel free to discuss this with me in the comments if you don't think so.
Based on these observations, Databricks discusses the following technical question: “Is it possible to turn data lakes based on standard open data formats, such as Parquet and ORC, into high-performance systems that can provide both the performance and management features of data warehouses and fast, direct I/O from advanced analytics workloads?”
They argue that this paradigm, referred to as a Lakehouse, can solve some of the challenges of data warehousing. Databricks believes Lakehouse will get more attention due to recent innovations that address its fundamental problems:
Reliable data management on data lakes: Like data lakes, the Lakehouse must be able to store raw data and support ETL/ELT processes. Initially, data lakes just meant “a bunch of files” in various formats, making it hard to offer some key management features of data warehouses, such as transactions or rollbacks to old table versions. However, systems such as Delta Lake or Apache Iceberg provide a transactional layer for data lake and enable these management features. In this case, there are fewer ETL steps overall, and analysts can also quickly performantly query the raw data tables if needed, like the first-generation analytics platforms.
Support for machine learning and data science: ML systems’ support for direct reads from data lake formats allows them efficient access to the data in the Lakehouse.
SQL performance: Lakehouses must provide state-of-the-art SQL performance on top of the massive datasets in the lake. Databricks show that various techniques can be used to maintain auxiliary data for the data and optimize its layout within these existing formats to achieve performance.
In the following sections, we will learn the motivation, technical designs, and research implications of Lakehouse platforms.
The Motivation
Here are some reasons that make Databricks think the Lakehouse architecture could eliminate the shortcomings of the data warehouse:
Data quality and reliability are the top challenges reported by enterprise data users. Implementing efficient data pipelines is hard, and dominant data architectures that separate the lake and warehouse add extra complexity to this problem.
More business applications require up-to-date data. Still, two-tier architectures increase data staleness by having a separate area for incoming data before loading it into the warehouse using periodic ETL/ELT jobs.
A large amount of data is now unstructured.
Data warehouses and lakes do not serve machine learning and data science applications well.
Some current industry trends give further evidence that customers are unsatisfied with the two-tier model:
All the big data warehouses have added support for external tables in Parquet and ORC format.
There is a broad investment in SQL engines run directly against data lake, such as Spark SQL or Presto.
However, these improvements can only solve some of the problems of lakes and warehouses architecture: the lakes still need essential management features such as ACID transactions and efficient data access methods to match the warehouse analytics performance.
The Lakehouse Architecture
Databricks defines a Lakehouse as a data management system based on low-cost storage that enhances traditional analytical DBMS management and performance features such as ACID transactions, versioning, caching, and query optimization. Thus, Lakehouses combine the benefits of both worlds. In the following sections, we will learn about the possible Lakehouse design proposed by Databicks.
Implementation
The first idea they introduce for Lakehouse implementation is to store data in a low-cost object store (e.g., Google Cloud storage) using a standard file format such as Apache Parquet with an additional transactional metadata layer on top of it to define which objects belong to a table. The metadata layer allows them to implement management features such as ACID transactions while achieving the low-cost advantage of object storage. Some candidates for the metadata layer implementation can be named Delta Lake, Apache Iceberg, or Apache Hudi. Moreover, Lakehouses can boost advanced analytics workloads and help them better at data management by providing declarative DataFrame APIs. Many ML frameworks, such as TensorFlow and Spark MLlib, can read data lake file formats like Parquet. This means the easiest way to integrate them with a Lakehouse would be to query the metadata layer to find out which Parquet files are part of a table and pass this information to the ML library.
Metadata Layer
Data lake storage systems such as S3 or HDFS only provide a low-level object store or filesystem interface. Over the years, the need for data management layers has emerged, starting with Apache Hive, which keeps track of which data files are part of a Hive table at a given table.
In 2016, Databricks started developing Delta Lake, which stores the information about which objects belong to which table in the object storage as a transaction log in Parquet format. Apache Iceberg, first introduced at Netflix, uses a similar design. Apache Hudi, which started at Uber, is another system in this area focused on streaming ingest into data lakes. Databricks observes that these systems provide similar or better performance to raw Parquet/ORC data lakes while improving data management, such as transactions, zero-copy, or time travel.
One thing to note here: they are easy to adopt for organizations that already have a data lake: e.g., Delta Lake can organize an existing directory of Parquet files into a Delta Lake table without moving data around by adding a transaction log over all the existing files. In addition, metadata layers can help in the implementation of data quality constraints. For example, Delta Lake constraints APIs let users apply constraints on the new data (e.g., a list of valid values for a specific column). Delta’s client libraries will reject all records that violate these constraints. Finally, metadata layers help implement governance features such as access control, e.g., it can check whether a client can access a table before granting credentials to read the table’s raw data.
SQL performance
Although a metadata layer adds management capabilities, more is needed to achieve the warehouse’s capability. SQL performance, in which the engine runs directly on the raw data, maybe the most significant technical question with the Lakehouse approach. Databricks proposes several techniques to optimize SQL performance in the Lakehouse. These techniques are independent of the chosen data format. These optimizations are:
Caching: When using the metadata layer, the Lakehouse system can cache files from the cloud object store on faster devices such as SSDs and RAM.
Auxiliary data: The Lakehouse can maintain other auxiliary file data to optimize queries. In Delta Lake and Delta Engine, Databricks maintains column min-max information for each data file, storing it in the same Parquet transaction log file. This enables the engine to skip unnecessary data in the scanning phase. They are also implementing a Bloom filter for the data-skipping purpose.
Data layout: Lakehouse can optimize many layout decisions. The first one is record ordering, in which records are clustered together; this makes it easier for the engine to read them together. Delta Lake supports ordering records using individual dimensions or space-filling curves such as the Z-order curve to provide more than one dimension locality.
These optimizations work well together for the typical access patterns in analytical systems. Typical queries focus on a “hot” subset of the data in the analytics workload, which can benefit from cache optimization. The critical performance factor for “cold” data in a cloud object store is the amount scanned per query. Combining data layout optimizations and auxiliary data structures allows the Lakehouse system to minimize I/O efficiently.
Efficient Access for Advanced Analytics
One approach is offering a declarative version of the DataFrame APIs used in Machine learning libraries, which maps data preparation computations into Spark SQL query plans and can benefit from the optimizations in Delta Lake. Thus, in implementing the Delta Lake data source, they leverage the caching, data skipping, and data layout optimizations described above to accelerate these reads from Delta Lake and accelerate ML and data science workloads.
Outro
If the solution can solve the real problem, it is not just a cliché term. Lakehouse was initially introduced to relieve the pain point of two-tier architecture: maintaining two separate systems for storage (the lakes) and analytics (the warehouses). By bringing the analytics power directly to the lakes, the Lakehouse paradigm has to deal with the most challenging problem: query performance; doing analytics directly on the raw data means the engine doesn’t know much about the data beforehand, and this could cause some trouble for the optimization process. Thanks to innovation in recent years of open table formats like Hudi, Iceberge, or Delta Lake, the Lakeshouse seems to keep up with the traditional warehouse in the performance competition. It’s an exciting future to observe the rise of the Lakehouse side, to co-exist with the lake-warehouse paradigm, or to replace the two-tier architecture completely; who knows?
Thank you for reading my blog. See you next week ;)
References
[1] Databricks, Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics (2020).
Before you leave
Leave a comment or contact me via LinkedIn or Email if you:
Are interested in this article and want to discuss it further.
Would you like to correct any mistakes in this article or provide feedback?
This includes writing mistakes like grammar and vocabulary. I happily admit that I'm not so proficient in English :D
It might take you 5 minutes to read, but it took me more than five days to prepare, so it would greatly motivate me if you considered subscribing to receive my writing.
You write "This two-tier data lake + warehouse architecture dominated the industry at the time of the paper’s writing".
Why do you think this is true? Perhaps it's true for large FAANG companies but very few of the SMBs I've worked with have more than a warehouse. The only data I could find on this was a Dremio 'State of the Data Lakehouse' survey but I suspect there's a large amount of selection bias going on since Dremio is a data lakehouse vendor. Even if if it's the case that a large amount of companies use a lakehouse, saying the architecture 'dominates' strikes me as hyperbole at best and inaccurate at worst.
Thanks for the article!
I love the story , My experience on this is really amazing. I have seen some more article around it which is also a great resources for Data Engineers.
Key Components of Data Lake Architecture
Data Lakes allow organizations to save a lot of work and time that is usually invested in creating a data structure. This enables fast ingestion and data storage.
Here are a few key components of a robust and effective Data Lake Architectural model:
Governance: This is vital in order to measure performance and improve Data Lake through monitoring and supervising operations.
Security: This is a key component to keep in mind during the initial phase of architecture. This is different from the security measures deployed for Relational Databases.
Metadata: Metadata means data that is referring to other data. For e.g. reload intervals, schemas, etc.
Stewardship: Depending on the organization, this role can be assigned to either the owners or a specialized team.
Monitoring & ELT Processes: A tool is required to organize the flow of data that is moving from the Raw layer through the Cleansed layer to the Sandbox and Application Layer since one might need to apply transformations to the data.
https://hevodata.com/learn/data-lake-architecture-a-comprehensive-guide/