Data platform architecture has an interesting history. Towards the turn of millennium, enterprises started to realize that the reporting and business intelligence workload required a new solution rather than the transactional applications. A read-optimized platform that can integrate data from multiple applications emerged. It was Datawarehouse.
In another decade, the internet and mobile started the generate data of unforeseen volume, variety and velocity. It required a different data platform solution. Hence, Data Lake emerged, which handles unstructured and structured data with huge volume.
Yet another decade passed. And it became clear that data lake and datawarehouse are no longer enough to handle the business complexity and new workload of the enterprises. It is too expensive. Value of the data projects are difficult to realize. Data platforms are difficult to change. Time demanded a new solution, again.
Guess what? This time, at least three different data platform solutions are emerging: Data Lakehouse, Data Fabric, and Data Mesh. While this is encouraging, it is also creating confusion in the market. The concepts and values are overlapping. At times different interpretations are emerging depending on who is being asked.
This article endeavors to alleviate those confusions. The concepts will be explained. And then a framework will be introduced, which will show how these three concepts may lead to one another or be used with each other.
Data lakehouse: A mostly new platform
Concept of lakehouse was made popular by Databricks. They defined it as: “A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.”
While traditional data warehouses made use of an Extract-Transform-Load (ETL) process to ingest data, data lakes instead rely on an Extract-Load-Transform (ELT) process. Extracted data from multiple sources is loaded into cheap BLOB storage, then transformed and persisted into a data warehouse, which uses expensive block storage.
This storage architecture is inflexible and inefficient. Transformation must be performed continuously to keep the BLOB and data warehouse storage in sync, adding costs. And continuous transformation is still time-consuming. By the time the data is ready for analysis, the insights it can yield will be stale relative to the current state of transactional systems.
Furthermore, data warehouse storage cannot support workloads like Artificial Intelligence (AI) or Machine Learning (ML), which require huge amounts of data for model training. For these workloads, data lake vendors usually recommend extracting data into flat files to be used solely for model training and testing purposes. This adds an additional ETL step, making the data even more stale.
Data lakehouse was created to solve these problems. The data warehouse storage layer is removed from lakehouse architectures. Instead, continuous data transformation is performed within the BLOB storage. Multiple APIs are added so that different types of workloads can use the same storage buckets. This is an architecture that’s well suited for the cloud since AWS S3 or Azure DLS2 can provide the requisite storage.
Data fabric: A mostly new architecture
The data fabric represents a new generation of data platform architecture. It can be defined as: A loosely coupled collection of distributed services, which enables the right data to be made available in the right shape, at the right time and place, from heterogeneous sources of transactional and analytical natures, across any cloud and on-premises platforms, usually via self-service, while meeting non-functional requirements including cost effectiveness, performance, governance, security and compliance.
The purpose of the data fabric is to make data available wherever and whenever it is needed, abstracting away the technological complexities involved in data movement, transformation and integration, so that anyone can use the data. Some key characteristics of data fabric are:
A network of data nodes
A data fabric is comprised of a network of data nodes (e.g., data platforms and databases), all interacting with one another to provide greater value. The data nodes are spread across the enterprise’s hybrid and multicloud computing ecosystem.
Each node can be different from the others
A data fabric can consist of multiple data warehouses, data lakes, IoT/Edge devices and transactional databases. It can include technologies that range from Oracle, Teradata and Apache Hadoop to Snowflake on Azure, RedShift on AWS or MS SQL in the on-premises data center, to name just a few.
All phases of the data-information lifecycle
The data fabric embraces all phases of the data-information-insight lifecycle. One node of the fabric may provide raw data to another that, in turn, performs analytics. These analytics can be exposed as REST APIs within the fabric, so that they can be consumed by transactional systems of record for decision-making.
Analytical and transactional worlds come together
Data fabric is designed to bring together the analytical and transactional worlds. Here, everything is a node, and the nodes interact with one another through a variety of mechanisms. Some of these require data movement, while others enable data access without movement. The underlying idea is that data silos (and differentiation) will eventually disappear in this architecture.
Security and governance are enforced throughout
Security and governance policies are enforced whenever data travels or is accessed throughout the data fabric. Just as Istio applies security governance to containers in Kubernetes, the data fabric will apply policies to data according to similar principles, in real time.
Data fabric promotes data discoverability. Here, data assets can be published into categories, creating an enterprise-wide data marketplace. This marketplace provides a search mechanism, utilizing metadata and a knowledge graph to enable asset discovery. This enables access to data at all stages of its value lifecycle.
The advent of the data fabric opens new opportunities to transform enterprise cultures and operating models. Because data fabrics are distributed but inclusive, their use promotes federated but unified governance. This will make the data more trustworthy and reliable. The marketplace will make it easier for stakeholders across the business to discover and use data to innovate. Diverse teams will find it easier to collaborate, and to manage shared data assets with a sense of common purpose.
Data fabric is an embracing architecture, where some new technologies (e.g., data virtualization) play a key role. But it allows existing databases and data platforms to participate in a network, where a data catalogue or data marketplace can help in discovering new assets. Metadata plays a key role here in discovering the data assets.
Data mesh: A mostly new culture
Data mesh as a concept is introduced by Thoughtworks. They defined it as: “…An analytical data architecture and operating model where data is treated as a product and owned by teams that most intimately know and consume the data.” The concept stands on four principles: Domain ownership, data as a product, self-serve data platforms, and federated computational governance.
Data fabric and data mesh as concepts have overlaps. For example, both recommend a distributed architecture – unlike centralized platforms such as datawarehouse, data lake, and data lakehouse. Both want to bring out the idea of a data product offered through a marketplace.
Differences exist also. As it is clear from the definition above, unlike data fabric, data mesh is about analytical data. It is narrower in focus than data fabric. Secondly, it emphasizes operational model and culture, meaning it is beyond just an architecture like data fabric. The nature of data product can be generic in data fabric, whereas data mesh clearly prescribes domain-driven ownership of data products.
The relationship between data lakehouse, data fabric and data mesh
Clearly, these three concepts have their own focus and strength. Yet, the overlap is evident.
Lakehouse stands apart from the other two. It is a new technology, like its predecessors. It can be codified. Multiple products exist in the market, including Databricks, Azure Synapse and Amazon Athena.
Data mesh requires a new operating model and cultural change. Often such cultural changes require a shift in the collective mindset of the enterprise. As a result, data mesh can be revolutionary in nature. It can be built from ground up at a smaller part of the organization before spreading into the rest of it.
Data fabric does not have such pre-requisites as data mesh. It is does not expect such cultural shift. It can be built up using existing assets, where the enterprise has invested over the period of years. Thus, its approach is evolutionary.
So how can an enterprise embrace all these concepts?
Address old data platforms by adopting a data lakehouse
It can embrace adoption of a lakehouse as part of its own data platform evolution journey. For example, a bank may get rid of its decade old datawarehouse and deliver all BI and AI use cases from a single data platform, by implementing a lakehouse.
Address data complexity with a data fabric architecture
If the enterprise is complex and has multiple data platforms, if data discovery is a challenge, if data delivery at different parts of the organization is difficult – data fabric may be a good architecture to adopt. Along with existing data platform nodes, one or multiple lakehouse nodes may also participate there. Even the transactional databases may also join the fabric network as nodes to offer or consume data assets.
Address business complexity with a data mesh journey
To address the business complexity, if the enterprise embarks upon a cultural shift towards domain driven data ownership, promotes self-service in data discovery and delivery, and adopts federated governance – they are on a data mesh journey. If the data fabric architecture is already in place, the enterprise may use it as a key enabler in their data mesh journey. For example, the data fabric marketplace may offer domain centric data products – a key data mesh outcome – from it. The metadata driven discovery already established as a capability through data fabric can be useful in discovering the new data products coming out of mesh.
Every enterprise can look at their respective business goals and decide which entry point suits them best. But even though entry points or motivations can be different, an enterprise may easily use all three concepts together in their quest to data-centricity.
The post Data platform trinity: Competitive or complementary? appeared first on Journey to AI Blog.