ML 10196 image001
The rise of text and semantic search engines has made ecommerce and retail businesses search easier for its consumers. Search engines powered by unified text and image can provide extra flexibility in search solutions. You can use both text and images as queries. For example, you have a folder of hundreds of family pictures in your laptop. You want to quickly find a picture that was taken when you and your best friend were in front of your old house’s swimming pool. You can use conversational language like “two people stand in front of a swimming pool” as a query to search in a unified text and image search engine. You don’t need to have the right keywords in image titles to perform the query.
Amazon OpenSearch Service now supports the cosine similarity metric for k-NN indexes. Cosine similarity measures the cosine of the angle between two vectors, where a smaller cosine angle denotes a higher similarity between the vectors. With cosine similarity, you can measure the orientation between two vectors, which makes it a good choice for some specific semantic search applications.
Contrastive Language-Image Pre-Training (CLIP) is a neural network trained on a variety of image and text pairs. The CLIP neural network is able to project both images and text into the same latent space, which means that they can be compared using a similarity measure, such as cosine similarity. You can use CLIP to encode your products’ images or description into embeddings, and then store them into an OpenSearch Service k-NN index. Then your customers can query the index to retrieve products that they’re interested in.
You can use CLIP with Amazon SageMaker to perform encoding. Amazon SageMaker Serverless Inference is a purpose-built inference service that makes it easy to deploy and scale machine learning (ML) models. With SageMaker, you can deploy serverless for dev and test, and then move to real-time inference when you go to production. SageMaker serverless helps you save cost by scaling down infrastructure to 0 during idle times. This is perfect for building a POC, where you will have long idle times between development cycles. You can also use Amazon SageMaker batch transform to get inferences from large datasets.
In this post, we demonstrate how to build a search application using CLIP with SageMaker and OpenSearch Service. The code is open source, and it is hosted on GitHub.
OpenSearch Service provides text-matching and embedding k-NN search. We use embedding k-NN search in this solution. You can use both image and text as a query to search items from the inventory. Implementing this unified image and text search application consists of two phases:
The following figure illustrates the solution architecture.
The workflow steps are as follows:
We use Amazon SageMaker Studio notebooks (not shown in the diagram) as the integrated development environment (IDE) to develop the solution.
To set up the solution, complete the following steps:
You can also use an AWS CloudFormation template by following the GitHub instructions to create a domain.
You can connect Studio to Amazon S3 from Amazon Virtual Private Cloud (Amazon VPC) using an interface endpoint in your VPC, instead of connecting over the internet. By using an interface VPC endpoint (interface endpoint), the communication between your VPC and Studio is conducted entirely and securely within the AWS network. Your Studio notebook can connect to OpenSearch Service over a private VPC to ensure secure communication.
OpenSearch Service domains offer encryption of data at rest, which is a security feature that helps prevent unauthorized access to your data. Node-to-node encryption provides an additional layer of security on top of the default features of OpenSearch Service. Amazon S3 automatically applies server-side encryption (SSE-S3) for each new object unless you specify a different encryption option.
In the OpenSearch Service domain, you can attach identity-based policies define who can access a service, which actions they can perform, and if applicable, the resources on which they can perform those actions.
This section discusses how to encode images and text into embeddings. This includes preparing data, creating a SageMaker model, and performing batch transform using the model.
You can use a SageMaker Studio notebook with a Python 3 (Data Science) kernel to run the sample code.
For this post, we use the Amazon Berkeley Objects Dataset. The dataset is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalogue images. We only use the item images and item names in US English. For demo purposes, we use approximately 1,600 products. For more details about this dataset, refer to the README. The dataset is hosted in a public S3 bucket. There are 16 files that include product description and metadata of Amazon products in the format of listings/metadata/listings_<i>.json.gz. We use the first metadata file in this demo.
You use pandas to load the metadata, then select products that have US English titles from the data frame. Pandas is an open-source data analysis and manipulation tool built on top of the Python programming language. You use an attribute called main_image_id to identify an image. See the following code:
There are 1,639 products in the data frame. Next, link the item names with the corresponding item images. images/metadata/images.csv.gz contains image metadata. This file is a gzip-compressed CSV file with the following columns: image_id, height, width, and path. You can read the metadata file and then merge it with item metadata. See the following code:
You can use the SageMaker Studio notebook Python 3 kernel built-in PIL library to view a sample image from the dataset:
Next, create a SageMaker model from a pretrained CLIP model. The first step is to download the pre-trained model weighting file, put it into a model.tar.gz file, and upload it to an S3 bucket. The path of the pretrained model can be found in the CLIP repo. We use a pretrained ResNet-50 (RN50) model in this demo. See the following code:
You then need to provide an inference entry point script for the CLIP model. CLIP is implemented using PyTorch, so you use the SageMaker PyTorch framework. PyTorch is an open-source ML framework that accelerates the path from research prototyping to production deployment. For information about deploying a PyTorch model with SageMaker, refer to Deploy PyTorch Models. The inference code accepts two environment variables: MODEL_NAME and ENCODE_TYPE. This helps us switch between different CLIP model easily. We use ENCODE_TYPE to specify if we want to encode an image or a piece of text. Here, you implement the model_fn, input_fn, predict_fn, and output_fn functions to override the default PyTorch inference handler. See the following code:
The solution requires additional Python packages during model inference, so you can provide a requirements.txt file to allow SageMaker to install additional packages when hosting models:
You use the PyTorchModel class to create an object to contain the information of the model artifacts’ Amazon S3 location and the inference entry point details. You can use the object to create batch transform jobs or deploy the model to an endpoint for online inference. See the following code:
Next, we use the CLIP model to encode item images into embeddings, and use SageMaker batch transform to run batch inference.
Before creating the job, use the following code snippet to copy item images from the Amazon Berkeley Objects Dataset public S3 bucket to your own bucket. The operation takes less than 10 minutes.
Next, you perform inference on the item images in a batch manner. The SageMaker batch transform job uses the CLIP model to encode all the images stored in the input Amazon S3 location and uploads output embeddings to an output S3 folder. The job takes around 10 minutes.
Load embeddings from Amazon S3 to a variable, so you can ingest the data into OpenSearch Service later:
This section discusses how to create a search engine that that uses k-NN search with embeddings. This includes configuring an OpenSearch Service cluster, ingesting item embedding, and performing free text and image search queries.
Earlier, you created an OpenSearch cluster. Now you’re going to create an index to store the catalog data and embeddings. You can configure the index settings to enable the k-NN functionality using the following configuration:
This example uses the Python Elasticsearch client to communicate with the OpenSearch cluster and create an index to host your data. You can run %pip install elasticsearch in the notebook to install the library. See the following code:
You now loop through your dataset and ingest items data into the cluster. The data ingestion for this practice should finish within 60 seconds. It also runs a simple query to verify if the data has been ingested into the index successfully. See the following code:
Now that you have a working OpenSearch Service index that contains embeddings of item images as our inventory, let’s look at how you can generate embedding for queries. You need to create two SageMaker endpoints to handle text and image embeddings, respectively.
You also create two functions to use the endpoints to encode images and texts. For the encode_text function, you add this is before an item name to translate an item name to a sentence for item description. memory_size_in_mb is set as 6 GB to serve the underline Transformer and ResNet models. See the following code:
You can firstly plot the picture that will be used.
Let’s look at the results of a simple query. After retrieving results from OpenSearch Service, you get the list of item names and images from dataset:
The first item has a score of 1.0, because the two images are the same. Other items are different types of glasses in the OpenSearch Service index.
You can use text to query the index as well:
You’re now able to get three pictures of water glasses from the index. You can find the images and text within the same latent space with the CLIP encoder. Another example of this is to search for the word “pizza” in the index:
With a pay-per-use model, Serverless Inference is a cost-effective option for an infrequent or unpredictable traffic pattern. If you have a strict service-level agreement (SLA), or can’t tolerate cold starts, real-time endpoints are a better choice. Using multi-model or multi-container endpoints provide scalable and cost-effective solutions for deploying large numbers of models. For more information, refer to Amazon SageMaker Pricing.
We suggest deleting the serverless endpoints when they are no longer needed. After finishing this exercise, you can remove the resources with the following steps (you can delete these resources from the AWS Management Console, or using the AWS SDK or SageMaker SDK):
In this post, we demonstrated how to create a k-NN search application using SageMaker and OpenSearch Service k-NN index features. We used a pre-trained CLIP model from its OpenAI implementation.
The OpenSearch Service ingestion implementation of the post is only used for prototyping. If you want to ingest data from Amazon S3 into OpenSearch Service at scale, you can launch an Amazon SageMaker Processing job with the appropriate instance type and instance count. For another scalable embedding ingestion solution, refer to Novartis AG uses Amazon OpenSearch Service K-Nearest Neighbor (KNN) and Amazon SageMaker to power search and recommendation (Part 3/4).
CLIP provides zero-shot capabilities, which makes it possible to adopt a pre-trained model directly without using transfer learning to fine-tune a model. This simplifies the application of the CLIP model. If you have pairs of product images and descriptive text, you can fine-tune the model with your own data using transfer learning to further improve the model performance. For more information, see Learning Transferable Visual Models From Natural Language Supervision and the CLIP GitHub repository.
Embeddings — vector-based numerical representations of typically unstructured data like text — have been primarily…
Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they…
This post is co-written with Sunaina Kavi, AI/ML Product Manager at Omada Health. Omada Health,…
Anthropic released Cowork on Monday, a new AI agent capability that extends the power of…
New York governor Kathy Hochul says she will propose a new law allowing limited autonomous…
Artificial intelligence (AI) is increasingly used to analyze medical images, materials data and scientific measurements,…