Safeguarding Privacy in Healthcare through Systematic Data Deletion

The challenges of keeping health data private — and why lineage-aware data deletion matters

The global COVID-19 crisis, the first pandemic of its kind in recent history, brought about an unprecedented set of challenges. Faced with a public health crisis, governments around the world established far-reaching emergency measures in an attempt to “flatten the curve” and protect citizens. In particular, governments codified extensive data collection practices at an institutional level to support these critical public health initiatives. These included the capture of data such as patient health information, COVID test result data, and even some location information (e.g., for contact tracing purposes). The majority of these measures were framed as temporary, but some ended up being extended for months, even years.

As the pressures of the pandemic have eased and a semblance of normalcy returns, the need to sustain emergency public health directives has subsided. Absent the initial exigency that motivated the amassing of this data, the question becomes: are there legitimate purposes for continuing to hold this data — and if so, in what form — or is it time to purge the records? In some cases, data may be of value for future research purposes, but even that would require serious consideration about the appropriate level of de-identification. The most sensitive data often should, in accordance with necessity and proportionality considerations and in the interests of public trust, be prepared for deletion.

As we’ve discussed in a prior post, Palantir has a deep understanding of the importance of deletion and the realities of carrying out deletion workflows in practice. This post aims to describe the subtleties of deletion in greater detail.

The Complexities of Deletion in a Data Platform

To day-to-day users of technology, deletion is a familiar concept, and may seem quite simple. One simply drags a file into the trash bin on their computer desktop, and expects the file to be deleted when the trash bin is emptied. But deleting data from large data systems is much more involved, for a couple of reasons.

The first reason can be attributed to the degree of deletion. Moving a file into a virtual trash bin is an inherently reversible action. One can pull the file out of the trash bin, and pretend it was never deleted in the first place. This kind of deletion, known as “soft deletion”, is generally considered insufficient when deleting sensitive data. While it may remove access rights, “soft deletion” does not reliably and irreversibly delete the information from the data platform. Our end goal, instead, is what we call “hard deletion”, a process which eventually makes data irreversibly inaccessible. This is the standard that governments and the institutions that serve as the stewards of sensitive records worldwide will likely aspire toward.

The second reason pertains to specificity. A data platform, with its massive scale, may include permutations of raw data in different places, some of which (if not all) may require deletion. An automated, precise, and reliable process is needed to ensure the identification and deletion of every subtle derivation of sensitive data, whole or partial.

For example, if a user deletes an email from their account, it doesn’t necessarily mean that that email is subsequently deleted from the recipients’ inboxes. This policy makes sense for an email client, but for a data platform that holds sensitive data with an expiration date, it is critical to ensure the erasure of that data regardless of where it has proliferated within the platform.

Keeping Track of Data Lineage

In complex systems, data can be replicated many times, combined with other data, and stored in different ways to support varied use cases. Conceptually, we describe this as “data lineage” — the full path of data, from provenance to its final form. And when a piece of data needs to be deleted, generally all of its “descendants” in the lineage must also be deleted in order to ensure completeness.

Palantir has invested considerable development efforts in ensuring that our Foundry product rigorously tracks the “data lineage” of all data processed through the platform. This enables Foundry users to reliably, systematically, and conclusively perform “hard deletions” of data that lives along the full lineage.

We believe that this capability will be particularly valuable for public and private sector organizations that are responsible for ensuring the comprehensive deletion of emergency healthcare data — no matter how broadly it may have been shared in the course of legitimate use.

In short, the annulment of emergency authorities and a return to pre-pandemic health information norms requires reliable methods of safeguarding privacy through sound data protection practices, including robust data deletion. For more information, check out our video on Data Lineage and Deletion above.


Safeguarding Privacy in Healthcare through Systematic Data Deletion was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.