Privacy-Enhancing Technologies (PETs) ensure the responsible use of personal data, but many organizations misunderstand how they are best deployed.
Editor’s note: This is the sixth post in the Palantir RFx Blog Series, which explores how organizations can better craft RFIs and RFPs to evaluate digital transformation software. Each post focuses on one key capability area within a data ecosystem, with the goal of helping companies ask the right questions to better assess technology. Previous installments in this series have included posts on Ontology, Data Connection, Version Control, Interoperability, and Operational Security.
At Palantir, we’ve invested enormous resources and effort establishing capabilities, procedures, and a general culture of responsibility around enabling our customers — whose data is among the most sensitive in the world — to carry out their critical data-driven missions. We’re not a data broker and we don’t assume control or ownership of customer data (i.e., we do not determine the means and purposes of how our customers process their data). Drawing on our expertise and deep commitment to developing, building, and deploying technologies that enable our customers to effectively and responsibly use their data, this post focuses on privacy-enhancing technologies (PETs), exploring what they are, why they’re important, and suggestions for how best to evaluate PETs.
Privacy is complex and nuanced, especially when it comes to data. In general, data or informational privacy refers to the interests of individuals and groups in determining how their data will be collected, used, managed, and shared over time. Many principles of data privacy protection and fair use overlap with information security principles. For individuals and groups, privacy questions center on how their data is used and how it will impact their livelihoods and well-being. The family of technologies that underwrite these privacy interests is referred to as “Privacy-Enhancing-Technologies” or PETs.
For technologists and program managers looking to leverage information systems to support their work, it is often helpful to think about data privacy in terms of the privacy harms or risks we wish to avoid and using that risk profile to shape how PETs are implemented.* This approach can help identify the most important ways that privacy interests can be undermined:
- Malicious External Vector: The malicious external vector is an actor who is attempting to hack or steal data so they can take advantage of its content.
- Malicious Internal Vector: The malicious internal vector is an actor who tries to leverage their privileged access to take advantage of sensitive information and its content in ways that defy or exploit the expectations for data use from those parties with a privacy interest.
- Well-meaning Internal Actor: The “friendly” internal vector is an actor who believes they are using data in appropriate ways, but are actually not doing so in accordance with the expectations of the individuals and organizations whose data is implicated. In many ways this is the most worrisome type of actor.
PETs are important because they provide the critical information tools for managing these risks and therefore help form a foundation of trust in information systems. If individuals or groups are going to consent to placing their personal data into systems, they need to trust that those systems will effectively protect their privacy.
What are Privacy Enhancing Technologies (PETs)?
Privacy-Enhancing Technologies (PETs) are technologies that are designed to implement principles of data protection at various stages of the data-use lifecycle in order to minimize risks of misuse and to help ensure the responsible, lawful, and secure use of personal data. PETs take many shapes and forms, including: standard data security techniques, cryptographic algorithms, masking sensitive fields, decentralized data processing, and hardware oriented solutions.
PETs are typically part of a broader data infrastructure and are best understood (and implemented) as instruments within that system rather than as standalone tools. Integrated together with other PETs, as well as with other technical products or organizational governance procedures, PETs can produce a holistic and configurable system for data governance when optimally deployed.
Before adopting or implementing relevant PETs, it’s important for organizations to ask specific questions related to their use and application. These questions help organizations establish what types of PETs will be relevant to their workflows and the privacy interests they are charged with protecting. Organizations must also consider their internal accountability, oversight, and governance structures. Namely, organizations should consider impacts across four categories:
- Users: How many users will have access to this data? How will this change over time? Risk grows with each user that gains access to the system. This is particularly important if users have an interest in what is represented by the data, perhaps to try to re-identifying the data, or learn about public figures or people they know.
- Permissions: How much data can users access? What other data can they access (i.e., outside of the platform where they access the deidentified data), and could this be combined with the deidentified data? Do these users have permissions that would allow them to import, export, or transfer the data in unanticipated ways?
- Policies: Are there clear data governance policies in place? How well does the average user understand them? Does the platform enforce these policies? Can data governance teams monitor and measure compliance?
- Metadata: Are datasets within the platform clearly labelled and described so that data governance and operational users can quickly understand their sensitivity, intended use, and the applicable policy protections?
Anonymization and PETs
One of the most well-known and established approaches to privacy-protective use of data is anonymization. We have previously explained the ins and outs of data anonymization at length and encourage interested readers to consult our white paper for a more detailed technical treatment of the topic.
Anonymization refers to a class of processes through which identifiable information can be removed from a dataset so that people described in the data cannot be reasonably re-identified, but some persisting form of the data can still be used to perform useful analysis. Historically, anonymization has been the workhorse of informational privacy. For example, anonymization has been used extensively in public health where learning about trends in diseases can have life-saving impacts but where the data about an individual patient deserves special protection due to the special sensitivity of personal health information.
The problem with anonymization is that it suggests a binary notion of privacy protection: anonymized data presents no privacy risks; non-anonymized data is risky. This simplified construct has been demonstrated to be misleading and incorrect. Data can almost never be fully anonymized (there will always be some residual re-attribution risk in the face of sufficiently motivated and well-resourced adversary) and the degree to which one pursues complete anonymization often comes at the price of diminished utility of the data.
We therefore recommend that practitioners avoid using the term “anonymization” (and even slightly more nuanced concepts like “pseudonymization”) in RFXs and more broadly in discussions of PETs. Instead, we suggest a focus on “de-identification” as a concept that captures the spectral rather than binary nature of privacy risks. Focusing on de-identification enables a more clear-eyed view of a range of techniques and spectrum of attendant impacts in reducing (but not necessarily eliminating) re-identification risks and evaluating the corresponding benefits and drawbacks of any given technique.
Basic vs. Exotic PETs
When it comes to implementing information systems that protect data privacy, it is generally best to start with tried and true technologies that reinforce critical privacy protection principles (“basic PETs”) before turning to novel, but less proven approaches (“exotic PETs”). Basic PETs include those that carry out various forms of data de-identification and are often deployed as components of a more comprehensive privacy framework that builds in redundancy and resiliency against a variety of potential intrusions, attacks, and inadvertent failures.
Any given PET is therefore only effective if the underlying data foundation it is constructed upon is sound. To this end, organizations need strong controls over their data processing operations, including the ability to check their data for quality, accuracy, and representativeness. In the absence of these controls, organizations will struggle to configure and apply PETs effectively and sustainably. This is true no matter what specific form or platform architecture the data ecosystem takes (e.g., whether data is held in a single data lake, a federated system, or various siloed systems).
Some programs, however, may be tempted to bypass the basics and rely on novel or “exotic” PETs to serve as a single, silver-bullet solution to address privacy risks. While the the allure of exotic PETs is understandable, it is often more important that program managers establish outcome-oriented project requirements. Program managers should develop a clear understanding of the security and privacy outcomes they hope to achieve through the use of PETs (or other tools, such as governance principles), and craft their requirements specifically to meet those goals. A few key questions to consider asking in this regard are:
- How sensitive is the data? There are many ways data could be sensitive: it could contain information on protected characteristics such as health, gender, or ethnicity; or it may be in some other sense intimate, personal, or confidential. A related question to ask is “What would the potential harm to these individuals be if this information were misused?”
- How easy is it to re-identify the data? To answer this question, consider how unique the individual data point is, i.e., how many individuals could it apply to? The fewer people, the higher the risk of re-identification.
- What happens if the data is joined with other data? Consider the other data in your system, both now and in the future. Could that data, joined with data deidentified through a given technique, result in a foreseeably significant re-identification risk? How likely is such a join to occur in the system (or if the data is published elsewhere)? What protections are in place to guard against it?
As consumer demands and privacy regulations evolve to more heavily emphasize data privacy protections in an increasingly digital world, it is tempting for industry and government programs to pursue increasingly exotic privacy technology solutions. Promises and hype abound; technologies like fully homomorphic encryption (FHE), differential privacy, synthetic data generation, secure multi-party computation (SMPC), etc., are certainly interesting but are no substitute for sound data practices. Many of these technologies are well-intentioned and technically impressive — at least on paper or in controlled, research settings. But the reality is that these technologies are, more often than not, highly specific to narrow privacy and security challenges. They also tend to be unproven, having only been tested within highly controlled settings and not in real-world operational settings where practical constraints of scale, interoperability, and extensibility really matter.
Take the example of privacy-protective synthetic data generation. These are PETs that take sensitive information and, using a variety of techniques, synthesize or scramble the individual data points to make them unidentifiable as information about real people. Proponents of this technology have gone so far as to suggest that these synthesized datasets reduce the risk of privacy intrusions to zero. While the methods are interesting and can be effective in certain settings, such far reaching claims have been roundly refuted by researchers (e.g., Machanavajjhala, Kifer, Abowd, Gehrke, and Volbuber; Stadler, Oprisanu, and Tronosco; and Bellovin, Dutta, and Reitinger) who point to limited applicability and unavoidable privacy-utility trade-offs. A fully privacy-protective synthetic version of data will have to be so thoroughly distorted that the resulting data will have limited or no utility. Conversely, a fully useful version of synthetic data will need to preserve characteristics of the real data that will inherently carry risks associated with reidentification.
Similar limitations have been noted with respect to other cutting-edge PETs, each serving as a reminder that in the world of PETs, there is no such thing as a free lunch.
Effective PETs should establish a system architecture that focuses not just on narrow interests or a specific privacy risk but on the full ecosystem and lifecycle of data management in complex real-world systems. In this more holistic setting, privacy risks may be better addressed through a combination of several interrelated and reinforcing technological safeguards. The following requirements include basic PETs that provide the foundation of a privacy-protective information system.
The solution must offer flexible and granular access permissions. Strong user access controls are foundational to any privacy-protective information system. These controls ensure that users only have access to precise subsets of data necessary for their responsibilities. Please see the previous RFx Blog post on Operational Security for more details on how to establish strong access permissions.
The solution must offer flexible and granular action permissions. The system should provide administrators with controls to restrict permissions to conduct potentially sensitive actions, such as importing, exporting, transferring, or combining data to those users who absolutely need to do so. The ability to implement action permissions generally requires a related set of data marking capabilities (persistently tagging sensitive datasets to clearly indicate their sensitivity, and to restrict actions such as joining them with datasets bearing other markings that may be risky in combination).
The solution must support a broad range of de-identification techniques, including generalization, aggregation, obfuscation on demand, obfuscation by default, dynamic minimization, and statistical anonymization. De-identification techniques cover a number of approaches that can be employed to minimize sensitive data exposure. It is critical that the system provide administrators with the flexibility to apply multiple de-identification techniques to the data; systems with more broadly defined anonymization options are inadequate. (As noted above, concepts like “anonymization” falsely assume that privacy risks can be removed by, for example, simply stripping out sensitive fields.) More specifically, de-identification techniques should include:
- Generalization: Reducing the granularity of information (e.g., converting Date of Birth to Age or Age Range).
- Aggregation: Grouping data about individuals together and continuing analysis at the aggregate level.
- Obfuscation on Demand: Hiding or disguising identifying data to unauthorized parties, perhaps by masking or encryption.
- Obfuscation by Default: Making data encrypted and unreadable by default. Users must enter an acceptable justification in order to de-crypt necessary subsets of the data.
- Dynamic Minimization: Showing only parts of the data depending on the needs or role of the user.
- Statistical Anonymization: A set of statistical techniques such as K-anonymity, I-diversity, t-closeness, etc., used to transform a given dataset in such a way as to provide some mathematical representation or assurance of reduced privacy risk.
The solution must include robust auditing capabilities. Audit logging capabilities empower oversight bodies to check and verify compliance with data governance policies around deidentified data. They also help organizations check that no spurious, malicious, or risky actions are undertaken.
The solution must provide capabilities to “infer” sensitive data. This includes running background checks to infer sensitive data across the system, automatically flagging and locking down sensitive data uploaded accidentally or deidentified insufficiently.
The solution must provide the ability to test and validate data before it is shared more widely. Even the most privacy-protective systems (and the operators/administrators that work with them) make errors when de-identifying data. For this reason, systems should provide the ability to perform data validations to test deidentified data before it is shared more widely within the system or exported for external use.
The solution must provide comprehensive data lineage capabilities, including transparency into all data pipelines within the platform. Data lineage refers to the ability to see and understand how data flows through a data ecosystem. Tracking lineage lets users and administrators understand how data is flowing within the system. This carries many different benefits, but in the context of privacy protections it lets organizations know which users have access to which resources, at which levels of identifiable data, and for what purposes at different stages. Additional information about lineage can be found in the previous RFx Series post on Data Pipeline Version Control.
The following criteria can be used to evaluate PETs for varying data ecosystems:
- Complexity of implementation and maintenance. The technical sophistication of PETs yields a number of challenges for organizations attempting to implement and enforce them across their enterprise. Exotic PETs are often difficult to implement and maintain without considerable in-house expertise and/or governance and orchestration infrastructure. Effective PETs governance may require semi-automation or other technical controls over the processes and policies supported by the PETs, which may add further layers of technical complexity. The setup costs required to establish PETs may also be prohibitive for certain organizations that would otherwise like to deploy PETs. The expertise to deploy PETs may also come at a high cost as the required skill sets are highly specialized and in limited supply.
- Configurability: While PETs may be marketed as one-size-fits-all, to claim that any technology will simply work out-of-the-box is often misleading. Because privacy risks, data assets, vulnerabilities, and infrastructure vary from domain to domain, effective privacy-enhancing tools must often be contextually adapted to the particular processing and threat circumstances (e.g., the specifics of particular ontologies, data, use cases, attack vectors). PETs that do not support some level of configurability may not actually be well-suited to address the particular privacy challenges of a given data ecosystem.
- Interoperability: Because it is rarely the case that a single PET will serve as a silver bullet for all privacy risks, programs should consider whether and how well their adopted PETs will interoperate with other PETs, as well as with other governance and oversight processes on which the system depends. Moreover, PETs are most interoperable when they also lend themselves to clear communication across the range of relevant stakeholders in the privacy-promoting space (e.g., operations, IT, legal). Effective PETs should be able to be integrated within a large enterprise and tailored for the variety of relevant stakeholders in ways that promote, rather than hinder, an understanding of how they work and how they protect privacy. See the previous RFX Series post on Interoperability for more information.
- Scalability and Computational Requirements: Programs should consider whether the implementation of a given PET may incur additional hardware or compute costs. Some PETs may rely on more sophisticated (and computationally intensive) implementations that may not scale readily or in cost-effective ways to real-world operations. These challenges may not always be apparent upfront as programs evaluate specific PETs in controlled test environments.
- Adaptability: Programs should assess whether proposed PET solutions are capable of flexing to anticipated future changes and demands, especially in dynamic, changing environments. PETs that are highly context dependent and brittle to significant changes may not be viable for long-term use.
Animal trainers often remark “there are no bad pets, only bad owners.” That’s roughly the case with Privacy-Enhancing Technologies (PETs) as well. The greatest challenge to the successful adoption of PETs is when programs opt for exotic, novel technologies based on misunderstandings of either their privacy requirements or what these tools can deliver in a real-world environment. Before looking to exotic PETs, organizations should consider how such innovative capabilities will be used within the existing organizational governance structure. They should also consider whether the desired tools will, in fact, scale well to complex, changing, expansive real-world application environments. And they should always build upon proven, basic PETs to address their identified privacy goals.
When considering PETS, organizations should also think carefully about who in oversight or governance roles will take responsibility to “own” the implementation and enforce use of PETs. PETs introduce a paradox whereby many of the most effective PETs work best when they augment human review/oversight. But without such oversight teams in the first place it is hard to implement PET-driven protections. This demonstrates the socio-technical nature of data governance and privacy, where both technical and human controls are required to uphold data governance effectively and with due regard for the context of the data and the projects built upon it.
* We understand that evaluation of harms and risks may not be the only lens through which to evaluate PETs. Other data privacy considerations include self-determination, individual well-being, social benefit, humanitarian principles, equitability, to name a few. But we focus on this minimal required view for the purposes of brevity.
Privacy-Enhancing Technologies (PETs): An adoption guide (Palantir RFx Blog Series, #6) was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.