In today’s digital landscape, the protection of personally identifiable information (PII) is not just a regulatory requirement, but a cornerstone of consumer trust and business integrity. Organizations use advanced natural language detection services like Amazon Lex for building conversational interfaces and Amazon CloudWatch for monitoring and analyzing operational data.
One risk many organizations face is the inadvertent exposure of sensitive data through logs, voice chat transcripts, and metrics. This risk is exacerbated by the increasing sophistication of cyber threats and the stringent penalties associated with data protection violations. Dealing with massive datasets is not just about identifying and categorizing PII. The challenge also lies in implementing robust mechanisms to obfuscate and redact this sensitive data. At the same time, it’s crucial to make sure these security measures don’t undermine the functionality and analytics critical to business operations.
This post addresses this pressing pain point, offering prescriptive guidance on safeguarding PII through detection and masking techniques specifically tailored for environments using Amazon Lex and CloudWatch Logs.
To address this critical challenge, our solution uses the slot obfuscation feature in Amazon Lex and the data protection capabilities of CloudWatch Logs, tailored specifically for detecting and protecting PII in logs.
In Amazon Lex, slots are used to capture and store user input during a conversation. Slots are placeholders within an intent that represent an action the user wants to perform. For example, in a flight booking bot, slots might include departure city, destination city, and travel dates. Slot obfuscation makes sure any information collected through Amazon Lex conversational interfaces, such as names, addresses, or any other PII entered by users, is obfuscated at the point of capture. This method reduces the risk of sensitive data exposure in chat logs and playbacks.
In CloudWatch Logs, data protection and custom identifiers add an additional layer of security by enabling the masking of PII within session attributes, input transcripts, and other sensitive log data that is specific to your organization.
This approach minimizes the footprint of sensitive information across these services and helps with compliance with data protection regulations.
In the following sections, we demonstrate how to identify and classify your data, locate your sensitive data, and finally monitor and protect it, both in transit and at rest, especially in areas where it may inadvertently appear. The following are the four ways to do this:
The first step is to identify and classify the data flowing through your systems. This involves understanding the types of information processed and determining their sensitivity level.
To determine all the slots in an intent in Amazon Lex, complete the following steps:
After you identify the slots within the intent, it’s important to classify them according to their sensitivity level and the potential impact of unauthorized access or disclosure. For example, you may have the following data types:
Email address and physical mailing address are often considered a medium classification level. Sensitive data, such as name, account number, and phone number, should be tagged with a high classification level, indicating the need for stringent security measures. These guidelines can help with systematically evaluating data.
After you classify the data, the next step is to locate where this data resides or is processed in your systems and applications. For services involving Amazon Lex and CloudWatch, it’s crucial to identify all data stores and their roles in handling PII.
CloudWatch captures logs generated by Amazon Lex, including interaction logs that might contain PII. Regular audits and monitoring of these logs are essential to detect any unauthorized access or anomalies in data handling.
Amazon S3 is often used in conjunction with Amazon Lex for storing call recordings or transcripts, which may contain sensitive information. Making sure these storage buckets are properly configured with encryption, access controls, and lifecycle policies are vital to protect the stored data.
Organizations can create a robust framework for protection by identifying and classifying data, along with pinpointing the data stores (like CloudWatch and Amazon S3). This framework should include regular audits, access controls, and data encryption to prevent unauthorized access and comply with data protection laws.
In this section, we demonstrate how to protect your data with Amazon Lex using slot obfuscation and selective conversation log capture.
Sensitive information can appear in the input transcripts of conversation logs. It’s essential to implement mechanisms that detect and mask or redact PII in these transcripts before they are stored or logged.
In the development of conversational interfaces using Amazon Lex, safeguarding PII is crucial to maintain user privacy and comply with data protection regulations. Slot obfuscation provides a mechanism to automatically obscure PII within conversation logs, making sure sensitive information is not exposed. When configuring an intent within an Amazon Lex bot, developers can mark specific slots—placeholders for user-provided information—as obfuscated. This setting tells Amazon Lex to replace the actual user input for these slots with a placeholder in the logs. For instance, enabling obfuscation for slots designed to capture sensitive information like account numbers or phone numbers makes sure any matching input is masked in the conversation log. Slot obfuscation allows developers to significantly reduce the risk of inadvertently logging sensitive information, thereby enhancing the privacy and security of the conversational application. It’s a best practice to identify and mark all slots that could potentially capture PII during the bot design phase to provide comprehensive protection across the conversation flow.
To enable obfuscation for a slot from the Amazon Lex console, complete the following steps:
Amazon Lex offers capabilities to select how conversation logs are captured with text and audio data from live conversations by enabling the filtering of certain types of information from the conversation logs. Through selective capture of necessary data, businesses can minimize the risk of exposing private or confidential information. Additionally, this feature can help organizations comply with data privacy regulations, because it gives more control over the data collected and stored. There is a choice between text, audio, or text and audio logs.
When selective conversation log capture is enabled for text and audio logs, it disables logging for all intents and slots in the conversation. To generate text and audio logs for particular intents and slots, set the text and audio selective conversation log capture session attributes for those intents and slots to “true”. When selective conversation log capture is enabled, any slot values in SessionState, Interpretations, and Transcriptions for which logging is not enabled using session attributes will be obfuscated in the generated text log.
To enable selective conversation log capture, complete the following steps:
Now selective conversation log capture for a slot is activated.
x-amz-lex:enable-audio-logging:<intent>:<slot> = "true"
x-amz-lex:enable-text-logging:<intent>:<slot> = "true"
Replace <intent> and <slot> with respective intent and slot names.
In this section, we demonstrate how to protect your data with CloudWatch using playbacks and log group policies.
When Amazon Lex engages in interactions, delivering prompts or messages from the bot to the customer, there’s a potential risk for PII to be inadvertently included in these communications. This risk extends to CloudWatch Logs, where these interactions are recorded for monitoring, debugging, and analysis purposes. The playback of prompts or messages designed to confirm or clarify user input can inadvertently expose sensitive information if not properly handled. To mitigate this risk and protect PII within these interactions, a strategic approach is necessary when designing and deploying Amazon Lex bots.
The solution lies in carefully structuring how slot values, which may contain PII, are referenced and used in the bot’s response messages. Adopting a prescribed format for passing slot values, specifically by encapsulating them within curly braces (for example, {slotName}
), allows developers to control how this information is presented back to the user and logged in CloudWatch. This method makes sure that when the bot constructs a message, it refers to the slot by its name rather than its value, thereby preventing any sensitive information from being directly included in the message content. For example, instead of the bot saying, “Is your phone number 123-456-7890? ” it would use a generic placeholder, “Is your phone number {PhoneNumber}? ” with {PhoneNumber}
being a reference to the slot that captured the user’s phone number. This approach allows the bot to confirm or clarify information without exposing the actual data.
When these interactions are logged in CloudWatch, the logs will only contain the slot name references, not the actual PII. This technique significantly reduces the risk of sensitive information being exposed in logs, enhancing privacy and compliance with data protection regulations. Organizations should make sure all personnel involved in bot design and deployment are trained on these practices to consistently safeguard user information across all interactions.
The following is a sample AWS Lambda function code in Python for referencing the slot value of a phone number provided by the user. SML tags are used to format the slot value to provide slow and clear speech output, and returning a response to confirm the correctness of the captured phone number:
Replace INTENT_NAME and SLOT_NAME with your preferred intent and slot names, respectively.
Sensitive data that’s ingested by CloudWatch Logs can be safeguarded by using log group data protection policies. These policies allow to audit and mask sensitive data that appears in log events ingested by the log groups in your account.
CloudWatch Logs supports both managed and custom data identifiers.
Managed data identifiers offer preconfigured data types to protect financial data, personal health information (PHI), and PII. For some types of managed data identifiers, the detection depends on also finding certain keywords in proximity with the sensitive data.
Each managed data identifier is designed to detect a specific type of sensitive data, such as name, email address, account numbers, AWS secret access keys, or passport numbers for a particular country or region. When creating a data protection policy, you can configure it to use these identifiers to analyze logs ingested by the log group, and take actions when they are detected.
CloudWatch Logs data protection can detect the categories of sensitive data by using managed data identifiers.
To configure managed data identifiers on the CloudWatch console, complete the following steps:
Custom data identifiers let you define your own custom regular expressions that can be used in your data protection policy. With custom data identifiers, you can target business-specific PII use cases that managed data identifiers don’t provide. For example, you can use custom data identifiers to look for a company-specific account number format.
To create a custom data identifier on the CloudWatch console, complete the following steps:
For details about the types of data that can be protected, refer to Types of data that you can protect.
In this section, we demonstrate how to protect your data in S3 buckets.
PII can often be captured in audio recordings, especially in sectors like customer service, healthcare, and financial services, where sensitive information is frequently exchanged over voice interactions. To comply with domain-specific regulatory requirements, organizations must adopt stringent measures for managing PII in audio files.
One approach is to disable the recording feature entirely if it poses too high a risk of non-compliance or if the value of the recordings doesn’t justify the potential privacy implications. However, if audio recordings are essential, streaming the audio data in real time using Amazon Kinesis provides a scalable and secure method to capture, process, and analyze audio data. This data can then be exported to a secure and compliant storage solution, such as Amazon S3, which can be configured to meet specific compliance needs including encryption at rest. You can use AWS KMS or AWS CloudHSM to manage encryption keys, offering robust mechanisms to encrypt audio files at rest, thereby securing the sensitive information they might contain. Implementing these encryption measures makes sure that even if data breaches occur, the encrypted PII remains inaccessible to unauthorized parties.
Configuring these AWS services allows organizations to balance the need for audio data capture with the imperative to protect sensitive information and comply with regulatory standards.
You can use an AWS CloudFormation template to configure various security settings for an S3 bucket that stores Amazon Lex data like audio recordings and logs. For more information, see Creating a stack on the AWS CloudFormation console. See the following example code:
The template defines the following properties:
This is just an example; you may need to adjust the configurations based on your specific requirements and security best practices.
In this section, we demonstrate how to protect your data with using a Service Control Policy (SCP). To create an SCP, see Creating an SCP.
To prevent changes to an Amazon Lex chatbot using an SCP, create one that denies the specific actions related to modifying or deleting the chatbot. For example, you could use the following SCP:
The code defines the following:
When this SCP is attached to an AWS Organizations organizational unit (OU) or an individual AWS account, it will allow only the specified provisioning role while preventing all other IAM entities (users, roles, or groups) within that OU or account from modifying or deleting the specified Amazon Lex bot, intents, and slot types.
This SCP only prevents changes to the Amazon Lex bot and its components. It doesn’t restrict other actions, such as invoking the bot or retrieving its configuration. If more actions need to be restricted, you can add them to the Action list in the SCP.
To prevent changes to a CloudWatch Logs log group using an SCP, create one that denies the specific actions related to modifying or deleting the log group. The following is an example SCP that you can use:
The code defines the following:
logs:DeleteLogGroup
and logs:PutRetentionPolicy
actions, which prevent deleting the log group and modifying its retention policy, respectively.Similar to the preceding chatbot SCP, when this SCP is attached to an Organizations OU or an individual AWS account, it will allow only the specified provisioning role to delete the specified CloudWatch Logs log group or modify its retention policy, while preventing all other IAM entities (users, roles, or groups) within that OU or account from performing these actions.
This SCP only prevents changes to the log group itself and its retention policy. It doesn’t restrict other actions, such as creating or deleting log streams within the log group or modifying other log group configurations. To restrict additional actions, add it to the Action list in the SCP.
Also, this SCP will apply to all log groups that match the specified resource ARN pattern. To target a specific log group, modify the Resource value accordingly.
When you create a data protection policy, by default, any sensitive data that matches the data identifiers you’ve selected is masked at all egress points, including CloudWatch Logs Insights, metric filters, and subscription filters. Only users who have the logs:Unmask
IAM permission can view unmasked data. The following is an SCP you can use:
It defines the following:
logs:Unmask
, which prevents viewing of masked data.Similar to the previous SCPs, when this SCP is attached to an Organizations OU or an individual AWS account, it will allow only the specified provisioning role while preventing all other IAM entities (users, roles, or groups) within that OU or account from unmasking sensitive data from the CloudWatch Logs log group.
Similar to the previous log group service control policy, this SCP only prevents changes to the log group itself and its retention policy. It doesn’t restrict other actions such as creating or deleting log streams within the log group or modifying other log group configurations. To restrict additional actions, add them to the Action list in the SCP.
Also, this SCP will apply to all log groups that match the specified resource ARN pattern. To target a specific log group, modify the Resource value accordingly.
To avoid incurring additional charges, clean up your resources:
Securing PII within AWS services like Amazon Lex and CloudWatch requires a comprehensive and proactive approach. By following the steps in this post—identifying and classifying data, locating data stores, monitoring and protecting data in transit and at rest, and implementing SCPs for Amazon Lex and Amazon CloudWatch—organizations can create a robust security framework. This framework not only protects sensitive data, but also complies with regulatory standards and mitigates potential risks associated with data breaches and unauthorized access.
Emphasizing the need for regular audits, continuous monitoring, and updating security measures in response to emerging threats and technological advancements is crucial. Adopting these practices allows organizations to safeguard their digital assets, maintain customer trust, and build a reputation for strong data privacy and security in the digital landscape.
Dipkumar Mehta is a Principal Consultant with the Amazon ProServe Natural Language AI team. He focuses on helping customers design, deploy, and scale end-to-end Conversational AI solutions in production on AWS. He is also passionate about improving customer experience and driving business outcomes by leveraging data. Additionally, Dipkumar has a deep interest in Generative AI, exploring its potential to revolutionize various industries and enhance AI-driven applications.
These beard tools deliver a quality trim for all types of facial hair.
Artificial intelligence (AI) research, particularly in the machine learning (ML) domain, continues to increase the…
Training large language models (LLMs) models has become a significant expense for businesses. For many…
o3 solved one of the most difficult AI challenges, scoring 75.7% on the ARC-AGI benchmark.…
The Trump transition team is looking for “big changes” at NASA—including some cuts.
A new artificial intelligence (AI) model has just achieved human-level results on a test designed…