One of the most useful application patterns for generative AI workloads is Retrieval Augmented Generation (RAG). In the RAG pattern, we find pieces of reference content related to an input prompt by performing similarity searches on embeddings. Embeddings capture the information content in bodies of text, allowing natural language processing (NLP) models to work with language in a numeric form. Embeddings are just vectors of floating point numbers, so we can analyze them to help answer three important questions: Is our reference data changing over time? Are the questions users are asking changing over time? And finally, how well is our reference data covering the questions being asked?
In this post, you’ll learn about some of the considerations for embedding vector analysis and detecting signals of embedding drift. Because embeddings are an important source of data for NLP models in general and generative AI solutions in particular, we need a way to measure whether our embeddings are changing over time (drifting). In this post, you’ll see an example of performing drift detection on embedding vectors using a clustering technique with large language models (LLMS) deployed from Amazon SageMaker JumpStart. You’ll also be able to explore these concepts through two provided examples, including an end-to-end sample application or, optionally, a subset of the application.
The RAG pattern lets you retrieve knowledge from external sources, such as PDF documents, wiki articles, or call transcripts, and then use that knowledge to augment the instruction prompt sent to the LLM. This allows the LLM to reference more relevant information when generating a response. For example, if you ask an LLM how to make chocolate chip cookies, it can include information from your own recipe library. In this pattern, the recipe text is converted into embedding vectors using an embedding model, and stored in a vector database. Incoming questions are converted to embeddings, and then the vector database runs a similarity search to find related content. The question and the reference data then go into the prompt for the LLM.
Let’s take a closer look at the embedding vectors that get created and how to perform drift analysis on those vectors.
Embedding vectors are numeric representations of our data so analysis of these vectors can provide insight into our reference data that can later be used to detect potential signals of drift. Embedding vectors represent an item in n-dimensional space, where n is often large. For example, the GPT-J 6B model, used in this post, creates vectors of size 4096. To measure drift, assume that our application captures embedding vectors for both reference data and incoming prompts.
We start by performing dimension reduction using Principal Component Analysis (PCA). PCA tries to reduce the number of dimensions while preserving most of the variance in the data. In this case, we try to find the number of dimensions that preserves 95% of the variance, which should capture anything within two standard deviations.
Then we use K-Means to identify a set of cluster centers. K-Means tries to group points together into clusters such that each cluster is relatively compact and the clusters are as distant from each other as possible.
We calculate the following information based on the clustering output shown in the following figure:
Additionally, we look at the proportion (higher or lower) of samples in each cluster, as shown in the following figure.
Finally, we use this analysis to calculate the following:
We can periodically capture this information for snapshots of the embeddings for both the source reference data and the prompts. Capturing this data allows us to analyze potential signals of embedding drift.
Periodically, we can compare the clustering information through snapshots of the data, which includes the reference data embeddings and the prompt embeddings. First, we can compare the number of dimensions needed to explain 95% of the variation in the embedding data, the inertia, and the silhouette score from the clustering job. As you can see in the following table, compared to a baseline, the latest snapshot of embeddings requires 39 more dimensions to explain the variance, indicating that our data is more dispersed. The inertia has gone up, indicating that the samples are in aggregate farther away from their cluster centers. Additionally, the silhouette score has gone down, indicating that the clusters are not as well defined. For prompt data, that might indicate that the types of questions coming into the system are covering more topics.
Next, in the following figure, we can see how the proportion of samples in each cluster has changed over time. This can show us whether our newer reference data is broadly similar to the previous set, or covers new areas.
Finally, we can see if the cluster centers are moving, which would show drift in the information in the clusters, as shown in the following table.
We can also evaluate how well our reference data aligns to the incoming questions. To do this, we assign each prompt embedding to a reference data cluster. We compute the distance from each prompt to its corresponding center, and look at the mean, median, and standard deviation of those distances. We can store that information and see how it changes over time.
The following figure shows an example of analyzing the distance between the prompt embedding and reference data centers over time.
As you can see, the mean, median, and standard deviation distance statistics between prompt embeddings and reference data centers is decreasing between the initial baseline and the latest snapshot. Although the absolute value of the distance is difficult to interpret, we can use the trends to determine if the semantic overlap between reference data and incoming questions is getting better or worse over time.
In order to gather the experimental results discussed in the previous section, we built a sample application that implements the RAG pattern using embedding and generation models deployed through SageMaker JumpStart and hosted on Amazon SageMaker real-time endpoints.
The application has three core components:
The following diagram illustrates the end-to-end architecture.
The full sample code is available on GitHub. The provided code is available in two different patterns:
To create the provided patterns, there are several prerequisites detailed in the following sections, starting with deploying the generative and text embedding models then moving on to the additional prerequisites.
Both patterns assume the deployment of an embedding model and generative model. For this, you’ll deploy two models from SageMaker JumpStart. The first model, GPT-J 6B, is used as the embedding model and the second model, Falcon-40b, is used for text generation.
You can deploy each of these models through SageMaker JumpStart from the AWS Management Console, Amazon SageMaker Studio, or programmatically. For more information, refer to How to use JumpStart foundation models. To simplify the deployment, you can use the provided notebook derived from notebooks automatically created by SageMaker JumpStart. This notebook pulls the models from the SageMaker JumpStart ML hub and deploys them to two separate SageMaker real-time endpoints.
The sample notebook also has a cleanup section. Don’t run that section yet, because it will delete the endpoints just deployed. You will complete the cleanup at the end of the walkthrough.
After confirming successful deployment of the endpoints, you’re ready to deploy the full sample application. However, if you’re more interested in exploring only the backend and analysis notebooks, you can optionally deploy only that, which is covered in the next section.
This pattern allows you to deploy the backend solution only and interact with the solution using a Jupyter notebook. Use this pattern if you don’t want to build out the full frontend interface.
You should have the following prerequisites:
Use the deployment parameters noted in the previous section to deploy the AWS CDK stack. For more information about AWS CDK installation, refer to Getting started with the AWS CDK.
Make sure that Docker is installed and running on the workstation that will be used for AWS CDK deployment. Refer to Get Docker for additional guidance.
Alternatively, you can enter the context values in a file called cdk.context.json
in the pattern1-rag/cdk
directory and run cdk deploy BackendStack --exclusively
.
The deployment will print out outputs, some of which will be needed to run the notebook. Before you can start question and answering, embed the reference documents, as shown in the next section.
For this RAG approach, reference documents are first embedded with a text embedding model and stored in a vector database. In this solution, an ingestion pipeline has been built that intakes PDF documents.
An Amazon Elastic Compute Cloud (Amazon EC2) instance has been created for the PDF document ingestion and an Amazon Elastic File System (Amazon EFS) file system is mounted on the EC2 instance to save the PDF documents. An AWS DataSync task is run every hour to fetch PDF documents found in the EFS file system path and upload them to an S3 bucket to start the text embedding process. This process embeds the reference documents and saves the embeddings in OpenSearch Service. It also saves an embedding archive to an S3 bucket through Kinesis Data Firehose for later analysis.
To ingest the reference documents, complete the following steps:
JumpHostId
) and connect using Session Manager, a capability of AWS Systems Manager. For instructions, refer to Connect to your Linux instance with AWS Systems Manager Session Manager./mnt/efs/fs1
, which is where the EFS file system is mounted, and create a folder called ingest
: ingest
directory.The DataSync task is configured to upload all files found in this directory to Amazon S3 to start the embedding process.
The DataSync task runs on an hourly schedule; you can optionally start the task manually to start the embedding process immediately for the PDF documents you added.
DataSyncTaskID
and start the task with defaults.After the embeddings are created, you can start the RAG question and answering through a Jupyter notebook, as shown in the next section.
Complete the following steps:
NotebookInstanceName
and connect to JupyterLab from the SageMaker console.fmops/full-stack/pattern1-rag/notebooks/
.query-llm.ipynb
in the notebook instance to perform question and answering using RAG.Make sure to use the conda_python3
kernel for the notebook.
This pattern is useful to explore the backend solution without needing to provision additional prerequisites that are required for the full-stack application. The next section covers the implementation of a full-stack application, including both the frontend and backend components, to provide a user interface for interacting with your generative AI application.
This pattern allows you to deploy the solution with a user frontend interface for question and answering.
To deploy the sample application, you must have the following prerequisites:
example.com
.example.com
and *.example.com
for all subdomains. For instructions, refer to Requesting a public certificate. This certificate is used to configure HTTPS on Amazon CloudFront and the origin load balancer.app.example.com
. app-lb.example.com
.ZXXXXXXXXYYYYYYYYY
.example.com
.Use the deployment parameters you noted in the prerequisites to deploy the AWS CDK stack. For more information, refer to Getting started with the AWS CDK.
Make sure Docker is installed and running on the workstation that will be used for the AWS CDK deployment.
In the preceding code, -c represents a context value, in the form of the required prerequisites, provided on input. Alternatively, you can enter the context values in a file called cdk.context.json
in the pattern1-rag/cdk
directory and run cdk deploy --all
.
Note that we specify the Region in the file bin/cdk.ts
. Configuring ALB access logs requires a specified Region. You can change this Region before deployment.
The deployment will print out the URL to access the Streamlit application. Before you can start question and answering, you need to embed the reference documents, as shown in the next section.
For a RAG approach, reference documents are first embedded with a text embedding model and stored in a vector database. In this solution, an ingestion pipeline has been built that intakes PDF documents.
As we discussed in the first deployment option, an example EC2 instance has been created for the PDF document ingestion and an EFS file system is mounted on the EC2 instance to save the PDF documents. A DataSync task is run every hour to fetch PDF documents found in the EFS file system path and upload them to an S3 bucket to start the text embedding process. This process embeds the reference documents and saves the embeddings in OpenSearch Service. It also saves an embedding archive to an S3 bucket through Kinesis Data Firehose for later analysis.
To ingest the reference documents, complete the following steps:
JumpHostId
) and connect using Session Manager./mnt/efs/fs1
, which is where the EFS file system is mounted, and create a folder called ingest
: ingest
directory.The DataSync task is configured to upload all files found in this directory to Amazon S3 to start the embedding process.
The DataSync task runs on an hourly schedule. You can optionally start the task manually to start the embedding process immediately for the PDF documents you added.
DataSyncTaskID
and start the task with defaults.After the reference documents have been embedded, you can start the RAG question and answering by visiting the URL to access the Streamlit application. An Amazon Cognito authentication layer is used, so it requires creating a user account in the Amazon Cognito user pool deployed via the AWS CDK (see the AWS CDK output for the user pool name) for first-time access to the application. For instructions on creating an Amazon Cognito user, refer to Creating a new user in the AWS Management Console.
In this section, we show you how to perform drift analysis by first creating a baseline of the reference data embeddings and prompt embeddings, and then creating a snapshot of the embeddings over time. This allows you to compare the baseline embeddings to the snapshot embeddings.
To create an embedding baseline of the reference data, open the AWS Glue console and select the ETL job embedding-drift-analysis
. Set the parameters for the ETL job as follows and run the job:
--job_type
to BASELINE
.--out_table
to the Amazon DynamoDB table for reference embedding data. (See the AWS CDK output DriftTableReference
for the table name.)--centroid_table
to the DynamoDB table for reference centroid data. (See the AWS CDK output CentroidTableReference
for the table name.)--data_path
to the S3 bucket with the prefix; for example, s3://
<REPLACE_WITH_BUCKET_NAME>/embeddingarchive/
. (See the AWS CDK output BucketName
for the bucket name.)Similarly, using the ETL job embedding-drift-analysis
, create an embedding baseline of the prompts. Set the parameters for the ETL job as follows and run the job:
--job_type
to BASELINE
--out_table
to the DynamoDB table for prompt embedding data. (See the AWS CDK output DriftTablePromptsName
for the table name.)--centroid_table
to the DynamoDB table for prompt centroid data. (See the AWS CDK output CentroidTablePrompts
for the table name.) --data_path
to the S3 bucket with the prefix; for example, s3://
<REPLACE_WITH_BUCKET_NAME>/promptarchive/
. (See the AWS CDK output BucketName
for the bucket name.)After you ingest additional information into OpenSearch Service, run the ETL job embedding-drift-analysis
again to snapshot the reference data embeddings. The parameters will be the same as the ETL job that you ran to create the embedding baseline of the reference data as shown in the previous section, with the exception of setting the --job_type
parameter to SNAPSHOT
.
Similarly, to snapshot the prompt embeddings, run the ETL job embedding-drift-analysis
again. The parameters will be the same as the ETL job that you ran to create the embedding baseline for the prompts as shown in the previous section, with the exception of setting the --job_type
parameter to SNAPSHOT
.
To compare the embedding baseline and snapshot for reference data and prompts, use the provided notebook pattern1-rag/notebooks/drift-analysis.ipynb
.
To look at embedding comparison for reference data or prompts, change the DynamoDB table name variables (tbl
and c_tbl
) in the notebook to the appropriate DynamoDB table for each run of the notebook.
The notebook variable tbl
should be changed to the appropriate drift table name. The following is an example of where to configure the variable in the notebook.
The table names can be retrieved as follows:
DriftTableReference
DriftTablePromptsName
In addition, the notebook variable c_tbl
should be changed to the appropriate centroid table name. The following is an example of where to configure the variable in the notebook.
The table names can be retrieved as follows:
CentroidTableReference
CentroidTablePrompts
First, run the AWS Glue job embedding-distance-analysis
. This job will find out which cluster, from the K-Means evaluation of the reference data embeddings, that each prompt belongs to. It then calculates the mean, median, and standard deviation of the distance from each prompt to the center of the corresponding cluster.
You can run the notebook pattern1-rag/notebooks/distance-analysis.ipynb
to see the trends in the distance metrics over time. This will give you a sense of the overall trend in the distribution of the prompt embedding distances.
The notebook pattern1-rag/notebooks/prompt-distance-outliers.ipynb
is an AWS Glue notebook that looks for outliers, which can help you identify whether you’re getting more prompts that are not related to the reference data.
All similarity scores from OpenSearch Service are logged in Amazon CloudWatch under the rag
namespace. The dashboard RAG_Scores
shows the average score and the total number of scores ingested.
To avoid incurring future charges, delete all the resources that you created.
Reference the cleanup up section of the provided example notebook to delete the deployed SageMaker JumpStart models, or you can delete the models on the SageMaker console.
If you entered your parameters in a cdk.context.json
file, clean up as follows:
If you entered your parameters on the command line and only deployed the backend application (the backend AWS CDK stack), clean up as follows:
If you entered your parameters on the command line and deployed the full solution (the frontend and backend AWS CDK stacks), clean up as follows:
In this post, we provided a working example of an application that captures embedding vectors for both reference data and prompts in the RAG pattern for generative AI. We showed how to perform clustering analysis to determine whether reference or prompt data is drifting over time, and how well the reference data covers the types of questions users are asking. If you detect drift, it can provide a signal that the environment has changed and your model is getting new inputs that it may not be optimized to handle. This allows for proactive evaluation of the current model against changing inputs.
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…