It is often said that a picture is worth a thousand words — when it comes to describing data however, metadata can sometimes be those thousand words. Metadata is the characteristics that describe data itself, from when data was last updated, to what the attributes and classification are — the list goes on.
In this blog post, we explore why metadata should never be viewed as an afterthought, but rather as a central component to every data platform — both when it comes to looking at data, but also operationalizing it — and the role it plays in protecting data more widely.
The first step to a successful metadata strategy is ensuring that relevant and necessary metadata is properly captured; the second is enabling the ability to interact with the metadata. This is where metadata management comes in. This is the practice of actually operationalizing this information — making it accessible to users, better organizing it to make it usable, enabling search functions, surfacing it to users to augment their experience, and establishing it within the organization to properly manage entire data platforms. In short, this means that metadata should be curated, searchable, accessible, and analyzable by authorized users next to the data itself in a timely way to provide its full potential.
Imagine finding an unlabeled box on a walk. You do not know how long it’s been there, where it came from, how it got there, who left it, what’s inside, whether it’s dangerous, or of value to you, or even if someone intentionally put it there for you to find. In that moment in time, it’s just a box, but with some additional context and some labels, you could discover what it is, if you should take it, and potentially if it is dangerous to be around.
This is where metadata comes in. What makes metadata powerful is its ability to provide the necessary context about the data in question. Data on its own is just of one-dimensional value, but it’s the insight to its properties that make it relevant and meaningful, as well as ensure it is used appropriately. Metadata provides this insight to data in multiple ways — not only how it relates to other data and what kind of data it is — but also how it has changed over time. Without this context, data could potentially be misused or misinterpreted, which ultimately reduces trust in data.
When we further apply metadata to facilitate data protection and data governance, it gives data its markers or characteristics that can in turn become the key to understanding what kind of restrictions, controls, and handling requirements apply — at whatever granularity from a data source, dataset, or even down to cell level. In short, matching data protection requirements to data itself demands its metadata.
It is worth distinguishing that in this blogpost, while there are many types of metadata, we seek to focus on descriptive, administrative, and contextual metadata that describes characteristics of data that can be used for governance, structure, and understanding the data itself. It is also worth noting that when deciding what metadata for systems to capture, it is important to keep in mind general best practices for data privacy, such that systems only capture necessary metadata proportional to its authorized and justified purposes.
At Palantir, we’ve invested in building tools for all types of users — data administrators, owners, consumers, as well as data producers, modelers, business analysts, executive audiences, and others. For people who have had to work with data, these questions will be familiar:
Allowing users to leverage metadata ensures not only that their data and analytics are built with as much context as possible, but can also reduce the chaos that often comes with growing volumes of data.
Now, making sure the necessary metadata is regularly captured as well as making it retrievable makes it both operational and often much more useful in how it can be leveraged. In light of this, we ensure metadata also satisfies the following:
When it comes to platforms with sensitive data, which can vary from confidential company data to personally identifiable data (PII), metadata can guide and help enforce data protection and governance workflows, which is a powerful way to operationalize it. This metadata provides context, and in turn tags the data it describes for relevant business rules, access control restrictions, and other relevant handling restrictions.
While the uses of metadata are effectively boundless, below we describe some of the most critical uses we have seen. Palantir Foundry captures, curates, organizes, and shares the metadata to users, enabling better outcomes by answering common questions for user and platform administrators — right at their fingertips. While Foundry applications, models, code, and other resources and artifacts also carry important metadata, here we seek to focus on and highlight some of the metadata captured with Foundry datasets as governance and data protection requirements often tie to the underlying data itself.
As mentioned above, capturing characteristics about data itself is the first step in understanding it. As an example, the view of a dataset in Foundry below provides a series of metadata that is available for every dataset. Some components are automatically generated, while others are manually tagged (such as tags and issues) and surfaced directly to users. Here’s a description of common metadata on Foundry datasets:
On top of presenting metadata alongside the data, users can then zoom out and review the metadata across all data on the platform using Foundry’s Data Lineage capability. The figure below shows how the same data pipeline can be visualized through the lens of varying metadata from resource types to permissions and access to build status to out-of-date data and more:
On top of capturing point-in-time information about datasets, Foundry also provides trend reports for certain metadata as it is captured. This not only means that metadata is regularly captured, but it can then be analyzed and viewed alongside the data itself. Here are some examples of dynamic metadata associated with Foundry datasets:
Data Health — Captures specific metrics about metadata over time, which can also be used to alert data issues or changes. As an example, ‘issues’ (see mentioned in point 7 above) can be created automatically when data issues arise. This can then easily be analyzed by someone looking at the data to see why data might not conform to quality standards. In providing this context, users can get a better sense of what to expect as well as what they are currently seeing.
Metadata can also be used for operational purposes such as for data governance and protection. This section provides a series of examples of how metadata in Foundry helps inform users when there is sensitive data and how they can easily use the available metadata to search and see that in the context of their other data.
Searchable Metadata — Foundry also enables users to easily search for columns, tags, descriptions, and other metadata across the platform. Users can even search for metadata such as columns across all datasets within a specific realm to know which datasets or resources contain sensitive data.
Operational Metadata for Review — This can also be done looking for Foundry issues as well — such as when PII might be detected — and triggers an issue for a Data Administrator to review:
As we have learned over the last 20 years, data is of course critical, but metadata provides richness to the data to allow an understanding into different facets of the data itself. These facets can also be used to inform data protection and governance workflows that are often manual and tedious to upkeep, but necessary to enforce at scale and when used on a day-to-day basis. To fully take advantage of metadata, Foundry programmatically captures it, enables users with analysis tools to see trends, to further operationalize data protection and governance for users to ensure its responsible and accountable use.
Palantir has worked with data in over 40 industries in its mission to solve the world’s hardest problems for vital institutions. Over time and across our customer base, we have learned that metadata can often be as valuable as the data itself, in providing transparency to the underlying information. The ability to capture, access, analyze, and operationalize metadata gives our users another dimension to understand the data itself and democratizes it by giving people the tools to understand the context around the data they need to use.
We built Foundry to empower our users — the true experts of their data — and our products seek to ensure those on the frontlines have the necessary details about their data to make the best decisions when they need it the most.
Alice Yu, Privacy & Civil Liberties Commercial and Public Health Lead, Palantir Technologies
Metadata Management for Data Protection was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Our next iteration of the FSF sets out stronger security protocols on the path to…
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this…
Generative AI has revolutionized technology through generating content and solving complex problems. To fully take…
At Google Cloud, we're deeply invested in making AI helpful to organizations everywhere — not…
Advanced Micro Devices reported revenue of $7.658 billion for the fourth quarter, up 24% from…