Categories: FAANG

Optimize RAG in production environments using Amazon SageMaker JumpStart and Amazon OpenSearch Service

rag sm arch

Generative AI has revolutionized customer interactions across industries by offering personalized, intuitive experiences powered by unprecedented access to information. This transformation is further enhanced by Retrieval Augmented Generation (RAG), a technique that allows large language models (LLMs) to reference external knowledge sources beyond their training data. RAG has gained popularity for its ability to improve generative AI applications by incorporating additional information, often preferred by customers over techniques like fine-tuning due to its cost-effectiveness and faster iteration cycles.

The RAG approach excels in grounding language generation with external knowledge, producing more factual, coherent, and relevant responses. This capability proves invaluable in applications such as question answering, dialogue systems, and content generation, where accuracy and informative outputs are crucial. For businesses, RAG offers a powerful way to use internal knowledge by connecting company documentation to a generative AI model. When an employee asks a question, the RAG system retrieves relevant information from the company’s internal documents and uses this context to generate an accurate, company-specific response. This approach enhances the understanding and usage of internal company documents and reports. By extracting relevant context from corporate knowledge bases, RAG models facilitate tasks like summarization, information extraction, and complex question answering on domain-specific materials, enabling employees to quickly access vital insights from vast internal resources. This integration of AI with proprietary information can significantly improve efficiency, decision-making, and knowledge sharing across the organization.

A typical RAG workflow consists of four key components: input prompt, document retrieval, contextual generation, and output. The process begins with a user query, which is used to search a comprehensive knowledge corpus. Relevant documents are then retrieved and combined with the original query to provide additional context for the LLM. This enriched input allows the model to generate more accurate and contextually appropriate responses. RAG’s popularity stems from its ability to use frequently updated external data, providing dynamic outputs without the need for costly and compute-intensive model retraining.

To implement RAG effectively, many organizations turn to platforms like Amazon SageMaker JumpStart. This service offers numerous advantages for building and deploying generative AI applications, including access to a wide range of pre-trained models with ready-to-use artifacts, a user-friendly interface, and seamless scalability within the AWS ecosystem. By using pre-trained models and optimized hardware, SageMaker JumpStart enables rapid deployment of both LLMs and embedding models, minimizing the time spent on complex scalability configurations.

In the previous post, we showed how to build a RAG application on SageMaker JumpStart using Facebook AI Similarity Search (Faiss). In this post, we show how to use Amazon OpenSearch Service as a vector store to build an efficient RAG application.

Solution overview

To implement our RAG workflow on SageMaker, we use a popular open source Python library known as LangChain. With LangChain, the RAG components are simplified into independent blocks that you can bring together using a chain object that will encapsulate the entire workflow. The solution consists of the following key components:

LLM (inference) – We need an LLM that will do the actual inference and answer the end-user’s initial prompt. For our use case, we use Meta Llama3 for this component. LangChain comes with a default wrapper class for SageMaker endpoints with which we can simply pass in the endpoint name to define an LLM object in the library.
Embeddings model – We need an embeddings model to convert our document corpus into textual embeddings. This is necessary for when we’re doing a similarity search on the input text to see what documents share similarities or contain the information to help augment our response. For this post, we use the BGE Hugging Face Embeddings model available in SageMaker JumpStart.
Vector store and retriever – To house the different embeddings we have generated, we use a vector store. In this case, we use OpenSearch Service, which allows for similarity search using k-nearest neighbors (k-NN) as well as traditional lexical search. Within our chain object, we define the vector store as the retriever. You can tune this depending on how many documents you want to retrieve.

The following diagram illustrates the solution architecture.

In the following sections, we walk through setting up OpenSearch, followed by exploring the notebook that implements a RAG solution with LangChain, Amazon SageMaker AI, and OpenSearch Service.

Benefits of using OpenSearch Service as a vector store for RAG

In this post, we showcase how you can use a vector store such as OpenSearch Service as a knowledge base and embedding store. OpenSearch Service offers several advantages when used for RAG in conjunction with SageMaker AI:

Performance – Efficiently handles large-scale data and search operations
Advanced search – Offers full-text search, relevance scoring, and semantic capabilities
AWS integration – Seamlessly integrates with SageMaker AI and other AWS services
Real-time updates – Supports continuous knowledge base updates with minimal delay
Customization – Allows fine-tuning of search relevance for optimal context retrieval
Reliability – Provides high availability and fault tolerance through a distributed architecture
Analytics – Provides analytical features for data understanding and performance improvement
Security – Offers robust features such as encryption, access control, and audit logging
Cost-effectiveness – Serves as an economical solution compared to proprietary vector databases
Flexibility – Supports various data types and search algorithms, offering versatile storage and retrieval options for RAG applications

You can use SageMaker AI with OpenSearch Service to create powerful and efficient RAG systems. SageMaker AI provides the machine learning (ML) infrastructure for training and deploying your language models, and OpenSearch Service serves as an efficient and scalable knowledge base for retrieval.

OpenSearch Service optimization strategies for RAG

Based on our learnings from the hundreds of RAG applications deployed using OpenSearch Service as a vector store, we’ve developed several best practices:

If you are starting from a clean slate and want to move quickly with something simple, scalable, and high-performing, we recommend using an Amazon OpenSearch Serverless vector store collection. With OpenSearch Serverless, you benefit from automatic scaling of resources, decoupling of storage, indexing compute, and search compute, with no node or shard management, and you only pay for what you use.
If you have a large-scale production workload and want to take the time to tune for the best price-performance and the most flexibility, you can use an OpenSearch Service managed cluster. In a managed cluster, you pick the node type, node size, number of nodes, and number of shards and replicas, and you have more control over when to scale your resources. For more details on best practices for operating an OpenSearch Service managed cluster, see Operational best practices for Amazon OpenSearch Service.
OpenSearch supports both exact k-NN and approximate k-NN. Use exact k-NN if the number of documents or vectors in your corpus is less than 50,000 for the best recall. For use cases where the number of vectors is greater than 50,000, exact k-NN will still provide the best recall but might not provide sub-100 millisecond query performance. Use approximate k-NN in use cases above 50,000 vectors for the best performance.
OpenSearch uses algorithms from the NMSLIB, Faiss, and Lucene libraries to power approximate k-NN search. There are pros and cons to each k-NN engine, but we find that most customers choose Faiss due to its overall performance in both indexing and search as well as the variety of different quantization and algorithm options that are supported and the broad community support.
Within the Faiss engine, OpenSearch supports both Hierarchical Navigable Small World (HNSW) and Inverted File System (IVF) algorithms. Most customers find HNSW to have better recall than IVF and choose it for their RAG use cases. To learn more about the differences between these engine algorithms, see Vector search.
To reduce the memory footprint to lower the cost of the vector store while keeping the recall high, you can start with Faiss HNSW 16-bit scalar quantization. This can also reduce search latencies and improve indexing throughput when used with SIMD optimization.
If using an OpenSearch Service managed cluster, refer to Performance tuning for additional recommendations.

Prerequisites

Make sure you have access to one ml.g5.4xlarge and ml.g5.2xlarge instance each in your account. A secret should be created in the same region as the stack is deployed.Then complete the following prerequisite steps to create a secret using AWS Secrets Manager:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.

For Secret type, select Other type of secret.
For Key/value pairs, on the Plaintext tab, enter a complete password.
Choose Next.

For Secret name, enter a name for your secret.
Choose Next.

Under Configure rotation, keep the settings as default and choose Next.

Choose Store to save your secret.

On the secret details page, note the secret Amazon Resource Name (ARN) to use in the next step.

Create an OpenSearch Service cluster and SageMaker notebook

We use AWS CloudFormation to deploy our OpenSearch Service cluster, SageMaker notebook, and other resources. Complete the following steps:

Launch the following CloudFormation template.
Provide the ARN of the secret you created as a prerequisite and keep the other parameters as default.

Choose Create to create your stack, and wait for the stack to complete (about 20 minutes).
When the status of the stack is CREATE_COMPLETE, note the value of OpenSearchDomainEndpoint on the stack Outputs tab.
Locate SageMakerNotebookURL in the outputs and choose the link to open the SageMaker notebook.

Run the SageMaker notebook

After you have launched the notebook in JupyterLab, complete the following steps:

Go to genai-recipes/RAG-recipes/llama3-RAG-Opensearch-langchain-SMJS.ipynb.

You can also clone the notebook from the GitHub repo.

Update the value of OPENSEARCH_URL in the notebook with the value copied from OpenSearchDomainEndpoint in the previous step (look for os.environ['OPENSEARCH_URL'] = ""). The port needs to be 443.
Run the cells in the notebook.

The notebook provides a detailed explanation of all the steps. We explain some of the key cells in the notebook in this section.

For the RAG workflow, we deploy the huggingface-sentencesimilarity-bge-large-en-v1-5 embedding model and meta-textgeneration-llama-3-8b-instruct LLM from Hugging Face. SageMaker JumpStart simplifies this process because the model artifacts, data, and container specifications are all prepackaged for optimal inference. These are then exposed using the SageMaker Python SDK high-level API calls, which let you specify the model ID for deployment to a SageMaker real-time endpoint:


 sagemaker.jumpstart.model  JumpStartModel

model_id  "meta-textgeneration-llama-3-8b-instruct"
accept_eula  
model  JumpStartModel(model_idmodel_id)
llm_predictor  modeldeploy(accept_eulaaccept_eula)

model_id  "huggingface-sentencesimilarity-bge-large-en-v1-5"
text_embedding_model  JumpStartModel(model_idmodel_id)
embedding_predictor  text_embedding_modeldeploy()

Content handlers are crucial for formatting data for SageMaker endpoints. They transform inputs into the format expected by the model and handle model-specific parameters like temperature and token limits. These parameters can be tuned to control the creativity and consistency of the model’s responses.

class Llama38BContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        payload = {
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": 1000,
                "top_p": 0.9,
                "temperature": 0.6,
                "stop": ["<|eot_id|>"],
            },
        }
        input_str = json.dumps(
            payload,
        )
        #print(input_str)
        return input_str.encode("utf-8")

We use PyPDFLoader from LangChain to load PDF files, attach metadata to each document fragment, and then use RecursiveCharacterTextSplitter to break the documents into smaller, manageable chunks. The text splitter is configured with a chunk size of 1,000 characters and an overlap of 100 characters, which helps maintain context between chunks. This preprocessing step is crucial for effective document retrieval and embedding generation, because it makes sure the text segments are appropriately sized for the embedding model and the language model used in the RAG system.

import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
documents = []
for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]
    documents += document
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)
docs = text_splitter.split_documents(documents)
print(docs[100])

The following block initializes a vector store using OpenSearch Service for the RAG system. It converts preprocessed document chunks into vector embeddings using a SageMaker model and stores them in OpenSearch Service. The process is configured with security measures like SSL and authentication to provide secure data handling. The bulk insertion is optimized for performance with a sizeable batch size. Finally, the vector store is wrapped with VectorStoreIndexWrapper, providing a simplified interface for operations like querying and retrieval. This setup creates a searchable database of document embeddings, enabling quick and relevant context retrieval for user queries in the RAG pipeline.

from langchain.indexes.vectorstore import VectorStoreIndexWrapper
# Initialize OpenSearchVectorSearch
vectorstore_opensearch = OpenSearchVectorSearch.from_documents(
    docs,
    sagemaker_embeddings,
    http_auth=awsauth,  # Auth will use the IAM role
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    bulk_size=2000  # Increase this to accommodate the number of documents you have
)
# Wrap the OpenSearch vector store with the VectorStoreIndexWrapper
wrapper_store_opensearch = VectorStoreIndexWrapper(vectorstore=vectorstore_opensearch)

Next, we use the wrapper from the previous step along with the prompt template. We define the prompt template for interacting with the Meta Llama 3 8B Instruct model in the RAG system. The template uses specific tokens to structure the input in a way that the model expects. It sets up a conversation format with system instructions, user query, and a placeholder for the assistant’s response. The PromptTemplate class from LangChain is used to create a reusable prompt with a variable for the user’s query. This structured approach to prompt engineering helps maintain consistency in the model’s responses and guides it to act as a helpful assistant.

prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{query}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["query"]
)
query = "How did AWS perform in 2021?"

answer = wrapper_store_opensearch.query(question=PROMPT.format(query=query), llm=llm)
print(answer)

Similarly, the notebook also shows how to use Retrieval QA, where you can customize how the documents fetched should be added to prompt using the chain_type parameter.

Clean up

Delete your SageMaker endpoints from the notebook to avoid incurring costs:

# Delete resources
llm_predictor.delete_model()
llm_predictor.delete_endpoint()
embedding_predictor.delete_model()
embedding_predictor.delete_endpoint()

Next, delete your OpenSearch cluster to stop incurring additional charges:aws cloudformation delete-stack --stack-name rag-opensearch

Conclusion

RAG has revolutionized how businesses use AI by enabling general-purpose language models to work seamlessly with company-specific data. The key benefit is the ability to create AI systems that combine broad knowledge with up-to-date, proprietary information without expensive model retraining. This approach transforms customer engagement and internal operations by delivering personalized, accurate, and timely responses based on the latest company data. The RAG workflow—comprising input prompt, document retrieval, contextual generation, and output—allows businesses to tap into their vast repositories of internal documents, policies, and data, making this information readily accessible and actionable. For businesses, this means enhanced decision-making, improved customer service, and increased operational efficiency. Employees can quickly access relevant information, while customers receive more accurate and personalized responses. Moreover, RAG’s cost-efficiency and ability to rapidly iterate make it an attractive solution for businesses looking to stay competitive in the AI era without constant, expensive updates to their AI systems. By making general-purpose LLMs work effectively on proprietary data, RAG empowers businesses to create dynamic, knowledge-rich AI applications that evolve with their data, potentially transforming how companies operate, innovate, and engage with both employees and customers.

SageMaker JumpStart has streamlined the process of developing and deploying generative AI applications. It offers pre-trained models, user-friendly interfaces, and seamless scalability within the AWS ecosystem, making it straightforward for businesses to harness the power of RAG.

Furthermore, using OpenSearch Service as a vector store facilitates swift retrieval from vast information repositories. This approach not only enhances the speed and relevance of responses, but also helps manage costs and operational complexity effectively.

By combining these technologies, you can create robust, scalable, and efficient RAG systems that provide up-to-date, context-aware responses to customer queries, ultimately enhancing user experience and satisfaction.

To get started with implementing this Retrieval Augmented Generation (RAG) solution using Amazon SageMaker JumpStart and Amazon OpenSearch Service, check out the example notebook on GitHub. You can also learn more about Amazon OpenSearch Service in the developer guide.

About the authors

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Raghu Ramesha is an ML Solutions Architect. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Sohaib Katariwala is a Sr. Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.

Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.

AI Generated Robotic Content