What is an open data lakehouse and why you should care?

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Why is a data lakehouse architecture becoming increasingly important?

To start, many organizations demand a better return on their datasets to improve decision-making. However, most of their raw data remains unused and trapped in data silos, making extracting insights difficult.

With the prohibitive costs of high-performance analytics solutions, such as cloud data warehouses, and performance challenges associated with legacy data lakes, neither data warehousing nor data lakes satisfy the need for analytical flexibility, cost-to-performance and manageability.

The rise of cloud object storage has driven the cost of storage down and new technologies have evolved to access and query the data stored in there more efficiently. A data lakehouse platform takes advantage of the low-cost storage and leverages the latest query engine to provide warehouse-like performance.

These new technologies and approaches, along with the desire to reduce data duplication and complex ETL pipelines, have resulted in a new architectural data platform approach known as the data lakehouse – offering the flexibility of a data lake with the performance and structure of a data warehouse.

Essential components of a data lakehouse architecture and what makes an open data lakehouse

At the core of a data lakehouse architecture includes the storage, metadata service and the query engine, and typically a data governance component made up of a policy engine and a data dictionary.

A data lakehouse strives to provide customers with flexibility and options, and an open data lakehouse takes this even further by leveraging open-source technologies such as Presto, enabling open governance and giving data scientists and data teams the option to leverage lakehouse components already in place while extending/adopting new lakehouse components based on business intelligence needs.

Storage: This is the layer that physically stores the data. The most common data lake/lakehouse storage types are AWS S3-compatible object storage or HDFS. In this layer, data is stored as files and could be stored in open data file formats such as Parque, Avro and more. Furthermore, metadata defining the table format may also be stored with the file. Open Data Formats are file specifications and protocols made available to the open-source community so that anyone can ingest and enhance, leading to widespread adoption and large communities.

Technical Metadata storage/service:  This component is required to understand what data is available in the storage layer. The query engine needs the metadata for the unstructured data and tables to understand where the data is located, what it looks like, and how to read it. The de-facto open metadata storage solution is the Hive Metadata Store.

In an open data lakehouse an open data governance approach is also supported. Organizations can bring their existing or preferred governance solution, preventing vendor data and metadata lock-in and eliminating or minimizing additional migration efforts.

SQL Query Engine: This component is at the heart of the open data lakehouse. It executes queries against the data and is often referred to as the “compute” component. There are many open-source query engines for lakehouse in the market, such as Presto or Spark. In a lakehouse architecture, the query engine is fully modular and ephemeral, meaning the engine can be dynamically scaled to meet big data workload demands and concurrency. SQL query engines can attach to any number of catalogs and storage.

Beyond the basics with data lakehouse governance

Aside from the core lakehouse components, an organization will also want an enterprise data governance solution to support data quality and security. At a basic level, a data catalog and policy engine are used to define rules with business semantics, and a plugin enables the engine to enforce governance policies during query execution.

Data Catalogs: Enable organizations to store business metadata, such as business terminologies and tags, to enable search and data protection. A Data catalog is essential to help users find the correct data for the job and semantic information for policies and rules.

Policy engine: This component enables users to define data protection policies and enables the engine to enforce those policies. This component is critical for an organization to achieve scalability in its governance framework. A policy engine is often deployed with the technical metadata service and the data catalog, so new/proprietary solutions often merge these components into a single service.

A word on managed services

Perhaps it occurred to you that other data lake implementations already offer some of these same features. Unfortunately for many organizations, maintaining these deployments can be complex. Studies have shown that the biggest hurdle for data lake adoption is the lack of IT skills required to manage them. Which is why a managed SaaS service is key to more modern, open data lakehouse implementations.

Data lakehouse architecture is getting attention, and organizations will want to optimize the components most critical to their business. An open lakehouse architecture brings the flexibility, modularity and cost-effective extensibility that your modern data science and data analytics use cases demand and simplifies taking advantage of future enhancements.


If you found this blog interesting and would like to discuss how you can experience, please contact me, Kevin Shen, at yuankai.shen@ibm.com.

Stay tuned for more blogs and updates on data lakehouse from IBM’s perspective!

The post What is an open data lakehouse and why you should care? appeared first on Journey to AI Blog.