How to modernize data lakes with a data lakehouse architecture

Data Lakes have been around for well over a decade now, supporting the analytic operations of some of the largest world corporations. Some argue though that the vast majority of these deployments have now become data “swamps”. Regardless of which side of this controversy you sit in, reality is that there is still a lot of data held in these systems. Such data volumes are not easy to move, migrate or modernize.

The challenges of a monolithic data lake architecture

Data lakes are, at a high level, single repositories of data at scale. Data may be stored in its raw original form or optimized into a different format suitable for consumption by specialized engines.

In the case of Hadoop, one of the more popular data lakes, the promise of implementing such a repository using open-source software and having it all run on commodity hardware meant you could store a lot of data on these systems at a very low cost. Data could be persisted in open data formats, democratizing its consumption, as well as replicated automatically which helped you sustain high availability. The default processing framework offered the ability to recover from failures mid-flight. This was, without a question, a significant departure from traditional analytic environments, which often meant vendor-lock in and the inability to work with data at scale.

Another unexpected challenge was the introduction of Spark as a processing framework for big data. It gained rapid popularity given its support for data transformations, streaming and SQL. But it never co-existed amicably within existing data lake environments. As a result, it often led to additional dedicated compute clusters just to be able to run Spark.

Fast forward almost 15 years and reality has clearly set in on the trade-offs and compromises this technology entailed. Their fast adoption meant that customers soon lost track of what ended up in the data lake. And, just as challenging, they could not tell where the data came from, how it had been ingested nor how it had been transformed in the process. Data governance remains an unexplored frontier for this technology. Software may be open, but someone needs to learn how to use it, maintain it and support it. Relying on community support does not always yield the required turn-around times demanded by business operations. High availability via replication meant more data copies on more disks, more storage costs and more frequent failures. A highly available distributed processing framework meant giving up on performance in favor of resiliency (we are talking orders of magnitude performance degradation for interactive analytics and BI).

Get the ebook on the benefits of a lakehouse architecture

Why modernize your data lake?

Data lakes have proven successful where companies have been able to narrow the focus on specific usage scenarios. But what has been clear is that there is an urgent need to modernize these deployments and protect the investment in infrastructure, skills and data held in those systems.

In a search for answers, the industry looked at existing data platform technologies and their strengths. It became clear that an effective approach was to bring together the key features of traditional (legacy, if you will) warehouses or data marts with what worked best from data lakes. Several items quickly raised to the top as table stakes:

  • Resilient and scalable storage that could satisfy the demand of an ever-increasing data scale.
  • Open data formats that kept the data accessible by all but optimized for high performance and with a well-defined structure.
  • Open (sharable) metadata that enables multiple consumption engines or frameworks.
  • Ability to update data (ACID properties) and support transactional concurrency.
  • Comprehensive data security and data governance (i.e. lineage, full-featured data access policy definition and enforcement including geo-dispersed)

The above has led to the advent of the data lakehouse. A data lakehouse is a data platform which merges the best aspects of data warehouses and data lakes into a unified and cohesive data management solution.

Benefits of modernizing data lakes to watsonx.data

IBM’s answer to the current analytics crossroad is watsonx.data. This is a new open data store for managing data at scale that allows companies to surround, augment and modernize their existing data lakes and data warehouses without the need to migrate. Its hybrid nature means you can run it on customer-managed infrastructure (on-premises and/or IaaS) and Cloud. It builds on a lakehouse architecture and embeds a single set of solutions (and common software stack) for all form factors.

Contrasting with competing offerings in the market, IBM’s approach builds on an open-source stack and architecture. These are not new components but well-established ones in the industry. IBM has taken care of their interoperability, co-existence and metadata exchange. Users can get started quickly—therefore dramatically reducing the cost of entry and adoption—with high level architecture and foundational concepts are familiar and intuitive:

  • Open data (and table formats) over Object Store
  • Data access through S3
  • Presto and Spark for compute consumption (SQL, data science, transformations, and streaming)
  • Open metadata sharing (via Hive and compatible constructs).

Watsonx.data offers companies a means of protecting their decades-long investment on data lakes and warehousing. It allows them to immediately expand and gradually modernize their installations focusing each component on the usage scenarios most important to them.

A key differentiator is the multi-engine strategy that allows users to leverage the right technology for the right job at the right time all via a unified data platform. Watsonx.data enables customers to implement fully dynamic tiered storage (and associated compute). This can lead, over time, to very significant data management and processing cost savings.

And if, ultimately, your objective is to modernize your existing data lakes deployments with a modern data lakehouse, watsonx.data facilitates the task by minimizing data migration and application migration via choice of compute.

What can you do next?

Over the past few years data lakes have played an important role in most enterprises’ data management strategy. If your goal is to evolve and modernize your data management strategy towards a truly hybrid analytics cloud architecture, then IBM’s new data store built on a data lakehouse architecture, watsonx.data, deserves your consideration.

Read the watsonx.data solution brief

Explore the watsonx.data product page

The post How to modernize data lakes with a data lakehouse architecture appeared first on IBM Blog.