12A0iQUuQWL9fFdp4QH7 MkWg
There is such a thing as too much data — when it becomes too overwhelming to properly figure out how to use it. Keeping track and making sense of what data your organization has, where it came from, who owns it, and how they can use it, alongside a myriad of other questions, can be a daunting task for every organization. The problems associated with these are no longer ‘nice-to-haves’ — advancing data protection regulation means that organizations around the world are expected to not only know what data points they collect and hold, but even more critically, where all of the sensitive data about their customers is.
Among the core lessons we’ve learned when enabling customers to grow their data ecosystems over time is that as operating systems scale, they start requiring the right tools to help sort through it — and often need something to bring order to what can be a chaotic environment. Palantir Foundry — our operating system for the modern enterprise — is designed to handle the very pain points associated with this scaling. As user numbers, data volume, and associated complexities all grow — data management and organization are key to enabling success.
We believe that organizations should invest in data catalogs to create an inventory of all data on a platform. This serves as a landing place to store, collect, search, surface, and analyze information available across the platform. Data catalogs provide the capability to search across all of these characteristics and help users find and appropriately use the data that they need, while simultaneously empowering data owners and administrators to tag critical information for users on how that data should be used. At Palantir, we invest in tools to provide transparency about data that is critical to enforcing data governance on the platform.
No two data catalogs are the same, but all data catalogs are designed to help bring order to ways of work. They centralize information and processes in a single ‘go-to’ resource, which can be further used to help standardize processes for better user management, instead of data and operations being bottlenecked by a centralized administrator manually quarterbacking information. Some of the reasons data catalogs are useful across an organization include:
The first step to a data catalog is knowing what data to collect and whether it can either be automated and detected or manually submitted. The reality is that this will likely evolve over time based on factors like the regulatory needs and internal processes — so a flexible data model is key.
We’ll take the general example of using a data catalog to first capture information about proposed data sources for approval by a data governance or administrator prior to the data being ingested, then serving as the inventory as the data lands and is prepared in the platform. Below are common questions we typically see as necessary content to capture about data to both understand the data, but also have context on how data should be handled and used:
Using Foundry, organizations can leverage existing applications and templates to set up a data catalog for all data not only within Foundry, but also across their organization. For purposes of this blog post, we’ll focus just on data cataloging within Foundry, but it is easily extensible to tracking data in other systems. We’ll walk through what this often looks like for many of our clients in the platform:
The first step to any data catalog is capturing the relevant data and information about it. This can be done in various ways. Most commonly, we see this done with a form capturing data as part of the data approval process. This allows platform administrators and governance leads to centralize their review of any data requests and proposals prior to the data being brought and transferred onto the platform. This is where basic information, rationale, and purposes can be captured from requestors for review. As noted above, this questionnaire is designed to be flexible in order to capture new requirements and questions as the platform evolves.
The implications of this data capture can be pivotal to how it can be used, especially with respect to different data protection regulations around the world, whether for geographies, sectors, or types of data. Examples of data handling requirements can range from anonymization requirements, retention requirements, or even data quality concerns. Establishing a method for oversight and review is important to ensure data platforms align and comply with requirements and so that users can know how to leverage data appropriately.
Another way this data can be populated is with automation for whenever data lands into the platform. Foundry can support this by automatically detecting new data coming onto the platform and regularly pulling metadata about those tables, such as columns, data formats, and other metadata from the source. The method selected largely depends on the ideal data governance workflow, specifically on whether data needs to be approved prior to landing in Foundry or not.
Once the data has been identified and relevant metadata and information captured, all of this data can then be deposited into a user-friendly data catalog. The data catalog provides the single pane of glass to navigate around everything involving data on the platform. This includes the list of all data available, where to input new information, the functionality to search across data, and much more.
Below is an example setup for a “Data Source Tracker” or data catalog for a financial institution. Users can scroll through all data, filter, or search for specific characteristics and statuses, and link directly to the source. Here, the Data Source Tracker signals to users if there is PII, if the data has not been approved, even if the data is ready or not for users to access it.
Typically in an organization without a data catalog, data can be scattered across the different departments and requires users to know the right person to ask and retrieve it in order provide you with the correct access. With the Data Source Tracker, the repository of knowledge is at the fingertips of the community providing the information, with the approvals and oversight managed in the common store.
When it comes to data protection, for users who are working with data they are just finding on the platform, the Data Source Tracker also provides a way for them to have a curated view of information they need to know about the data. This is particularly relevant in cases when data needs to be handled in specific ways or only for specific use cases.
For instance, some data may need to have restrictions on what PII can be shared for certain purposes, others may require data to be aggregated or deidentified prior to further sharing. The power of the Data Source Tracker is that the data governance instructions sit alongside the data, streamlining the process for getting context and using the data.
The power of the Data Catalog then can come with the ability to quickly search for information across the platform. Whether looking for columns or characteristics, the ability to dynamically search gives users a quick way to find what they are looking for faster and more comprehensively. For instance, in the example of a financial institution, analysts may be looking for data to build next best offer models based on geographies. They might start with first seeing where there is PII data about customers, filtering for those datasets, and then finding which tables have addresses about individuals.
From our work with everyone from car manufacturers to pharmaceutical companies and public health agencies around the world, we have seen that as organizations with large-scale operations become more data-driven, the transition demands systems and processes to responsibly govern data across a data platform, such as using data catalogs.
Data catalogs give platform administrators the ability to centrally funnel and manage data coming onto the platform and signal data governance controls, while giving users the ability to quickly search, filter, and navigate to the data they need. It becomes a central component for data governance, while being able to leverage the underlying flexibility of Foundry to adapt over time. This investment in making data cataloging easy and adaptable unlocks the power of data platforms by structuring data, making users and ultimately organizations, more powerful in their day-to-day operations.
Alice Yu, Privacy & Civil Liberties Commercial and Public Health Lead, Palantir Technologies
Data Cataloging: Bringing order to chaos was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Hello! My partner and I have been grinding on character consistency for Wan 2.2. After…
AI coding requires a serious structural change. Where does that leave entry-level developers and the…
In 2025, 256 gigabytes just isn’t enough, and tacking on more storage isn’t as easy…
I started with trying to recreate SD3 style glitches but ended up discovering this is…
This post is divided into six parts; they are: • Why Transformer is Better than…
Scientists hope their plumage project could someday lead to biocompatible lasers that could safely be…