The Data Lakehouse Is On the Horizon, But It’s Not Smooth Sailing Yet

by Eric Lee, Principal & Rich Ellinger, Chief Technology Officer, at Cota Capital

Published on Feb. 6, 2024 by Datanami

Data warehouses and data lakes serve clear and distinct purposes. Typically, data warehouses store structured data according to a predefined schema to generate fast query speeds for reporting purposes. Data lakes, on the other hand, store and process diverse data types, including unstructured data, and support advanced analytics, data discovery, and AI and ML workloads.

Recently, the concept of a “data lakehouse” has emerged to combine the best of both these worlds.

In theory, a data lakehouse obviates the necessity of using two separate systems for data storage and analytics. It would integrate the two, eliminating the need to move data between systems and enabling querying across all sets of data seamlessly. In addition, as companies seek to leverage the benefits of AI, a data lakehouse can offer AI models a single source of truth and a more comprehensive view of the data. A data lakehouse would also cut costs. Enterprise customers today complain that expenses are skyrocketing because they must pay heIy prices to use both a data warehouse and a data lake.

Naturally, vendors like Snowflake (a leader in data warehousing) and Databricks (a leader in data lakes) are eager to expand into each other’s fast-growing markets, and the competition is only intensifying as companies vie for AI/ML workloads. Together, these sectors are expected to grow at a 25% CAGR from 2022 to 2026, which is 1.7 times faster than the rate of the overall data analytics market. At the expected growth rates, the combined markets are poised to become the largest segment within data analytics, surpassing spending on both relational and non-relational databases. Already, both these companies are actively developing products and technology to expand capabilities and move into the other’s core domain in their quest to become a data lakehouse. We’re not there yet.

But while the idea of a lakehouse is appealing, it may be more of a vision than reality at this point in time. Yes, combining the querying speeds of data warehouses with the data structure flexibility of data lakes would be a game-changer. The problem, however, is that their underlying architectures are structurally different.

Efforts have been made to enable the transition of data lakes to data lakehouses through the development of specific technologies. One such advancement involves new query engine designs that facilitate high-performance SQL execution on data lakes. These query engine accelerators create a software layer above open table formats like Delta Lake, Apache Hudi, and Apache Iceberg, and bring improved performance that approaches the querying speeds of data warehouses.

However, a limitation of these query engine accelerators is their tendency to falter under the strain of thousands of concurrent users attempting to access the same data. This scalability issue could hinder their widespread adoption and utility in large-scale enterprise scenarios. So, while these query engines can significantly enhance the value of data lakes, they are unlikely to completely replace the functionality of data warehouses.

Data warehouses, on their end, are adopting open table formats to enable data lake capabilities and facilitate the transition to data lakehouses. For instance, AWS and Google Cloud leverage open table format Apache Iceberg for their “data lake engine.” They store unstructured data in S3 or Google Cloud Storage, while structured data resides in Redshift or BigQuery.

Snowflake, meanwhile, is attempting to eliminate the need for Databricks by processing Spark data directly on its platorm through Snowpark. The reality, however, is that Snowflake has not yet achieved feature parity with Databricks. In particular, Databricks remains superior in its core areas because of its development of use-case-specific engine accelerators.

Another key drawback of the data lakehouse concept is vendor lock-in. The reality is that most companies do not want to become heavily dependent on a sole technology provider for their data storage, processing, and analytics needs. This dependency can limit an organization’s flexibility in the long run, because it’s challenging to switch to other vendors without significant effort, cost, and potential disruption to operations.

Who will get to the lakehouse first?

While there is a real desire to create a data lakehouse given the potential benefits of a single platform, there is no clear consensus about whether data lakes or warehouses are best positioned to achieve the lakehouse paradigm first.

Some believe that cloud data warehouses have solved the toughest problem of data concurrency, allowing thousands of users access to data simultaneously. Others posit that it is easier to layer in data optimization than to replicate data flexibility, providing data lakes with an advantage.

So, while the concept of a data lakehouse remains attractive, it is our belief that customers will continue to run data lake and data warehouse technologies in parallel for the foreseeable future.