Large language models (LLMs) are very large deep-learning models that are pre-trained on vast amounts of data. LLMs are incredibly flexible. One model can perform completely different tasks such as answering questions, summarizing documents, translating languages, and completing sentences. LLMs have the potential to revolutionize content creation and the way people use search engines and virtual assistants. Retrieval Augmented Generation (RAG) is the process of optimizing the output of an LLM, so it references an authoritative knowledge base outside of its training data sources before generating a response. While LLMs are trained on vast volumes of data and use billions of parameters to generate original output, RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base—without having to retrain the LLMs. RAG is a fast and cost-effective approach to improve LLM output so that it remains relevant, accurate, and useful in a specific context. RAG introduces an information retrieval component that uses the user input to first pull information from a new data source. This new data from outside of the LLM’s original training data set is called external data. The data might exist in various formats such as files, database records, or long-form text. An AI technique called embedding language models converts this external data into numerical representations and stores it in a vector database. This process creates a knowledge library that generative AI models can understand.
RAG introduces additional data engineering requirements:
In this post, we will explore building a reusable RAG data pipeline on LangChain—an open source framework for building applications based on LLMs—and integrating it with AWS Glue and Amazon OpenSearch Serverless. The end solution is a reference architecture for scalable RAG indexing and deployment. We provide sample notebooks covering ingestion, transformation, vectorization, and index management, enabling teams to consume disparate data into high-performing RAG applications.
Data pre-processing is crucial for responsible retrieval from your external data with RAG. Clean, high-quality data leads to more accurate results with RAG, while privacy and ethics considerations necessitate careful data filtering. This lays the foundation for LLMs with RAG to reach their full potential in downstream applications.
To facilitate effective retrieval from external data, a common practice is to first clean up and sanitize the documents. You can use Amazon Comprehend or AWS Glue sensitive data detection capability to identify sensitive data and then use Spark to clean up and sanitize the data. The next step is to split the documents into manageable chunks. The chunks are then converted to embeddings and written to a vector index, while maintaining a mapping to the original document. This process is shown in the figure that follows. These embeddings are used to determine semantic similarity between queries and text from the data sources
In this solution, we use LangChain integrated with AWS Glue for Apache Spark and Amazon OpenSearch Serverless. To make this solution scalable and customizable, we use Apache Spark’s distributed capabilities and PySpark’s flexible scripting capabilities. We use OpenSearch Serverless as a sample vector store and use the Llama 3.1 model.
The benefits of this solution are:
This solution covers the following areas:
To continue this tutorial, you must create the following AWS resources in advance:
Complete the following steps to launch an AWS Glue Studio notebook:
Now you have configured the required settings for your AWS Glue notebook.
First, create a vector store. A vector store provides efficient vector similarity search by providing specialized indexes. RAG complements LLMs with an external knowledge base that’s typically built using a vector database hydrated with vector-encoded knowledge articles.
In this example, you will use Amazon OpenSearch Serverless for its simplicity and scalability to support a vector search at low latency and up to billions of vectors. Learn more in Amazon OpenSearch Service’s vector database capabilities explained.
Complete the following steps to set up OpenSearch Serverless:
You have provisioned OpenSearch Serverless successfully. Now you’re ready to inject documents into the vector store.
In this example, you will use a sample HTML file as the HTML input. It’s an article with specialized content that LLMs cannot answer without using RAG.
parse_html
and format_md
. We use Beautiful Soup to parse HTML, and convert it to Markdown using markdownify in order to use MarkdownTextSplitter for chunking. These functions will be used inside a Spark Python user-defined function (UDF) in later cells.MarkdownTextSplitter
to split the text along markdown-formatted headings into manageable chunks. Adjusting chunk size and overlap is crucial to help prevent the interruption of contextual meaning, which can affect the accuracy of subsequent vector store searches. The example uses a chunk size of 1,000 and a chunk overlap of 100 to preserve information continuity, but these settings can be fine-tuned to suit different use cases.process_batchinjects
the documents into the vector store through OpenSearch implementation inside LangChain, which inputs the embeddings model and the documents to create the entire vector store.You have successfully ingested an embedding into the OpenSearch Serverless collection.
In this section, we are going to demonstrate the question-answering capability using the embedding ingested in the previous section.
OpenSearchVectorSearch
client, the LLM using Llama 3.1, and define RetrievalQA where you can customize how the documents fetched should be added to the prompt using the chain_type
Optionally, you can choose other foundation models (FMs). For such cases, refer to the model card to adjust the chunking length.Now that you have the relevant documents, it’s time to use the LLM to generate an answer based on the embeddings.
As you expect, the LLM answered with a detailed explanation about task decomposition. For production workloads, balancing latency and cost efficiency is crucial in semantic searches through vector stores. It’s important to select the most suitable k-NN algorithm and parameters for your specific needs, as detailed in this post. Additionally, consider using product quantization (PQ) to reduce the dimensionality of embeddings stored in the vector database. This approach can be advantageous for latency-sensitive tasks, though it might involve some trade-offs in accuracy. For additional details, see Choose the k-NN algorithm for your billion-scale use case with OpenSearch.
Now to the final step, cleaning up the resources:
This post explored a reusable RAG data pipeline using LangChain, AWS Glue, Apache Spark, Amazon SageMaker JumpStart, and Amazon OpenSearch Serverless. The solution provides a reference architecture for ingesting, transforming, vectorizing, and managing indexes for RAG at scale by using Apache Spark’s distributed capabilities and PySpark’s flexible scripting capabilities. This enables you to preprocess your external data in the phases including cleaning, sanitization, chunking documents, generating vector embeddings for each chunk, and loading into a vector store.
Our new AI system accurately identifies errors inside quantum computers, helping to make this new…
Estimating the density of a distribution from samples is a fundamental problem in statistics. In…
Swiss Re & PalantirScaling Data Operations with FoundryEditor’s note: This guest post is authored by our customer,…
As generative AI models advance in creating multimedia content, the difference between good and great…
Large language models (LLMs) give developers immense power and scalability, but managing resource consumption is…
We dive into the most significant takeaways from Microsoft Ignite, and Microsoft's emerging leadership in…