ML 17334 RAG
The report The economic potential of generative AI: The next productivity frontier, published by McKinsey & Company, estimates that generative AI could add an equivalent of $2.6 trillion to $4.4 trillion in value to the global economy. The largest value will be added across four areas: customer operations, marketing and sales, software engineering, and R&D.
The potential for such large business value is galvanizing tens of thousands of enterprises to build their generative AI applications in AWS. However, many product managers and enterprise architect leaders want a better understanding of the costs, cost-optimization levers, and sensitivity analysis.
This post addresses these cost considerations so you can optimize your generative AI costs in AWS.
The post assumes a basic familiarity of foundation model (FMs) and large language models (LLMs), tokens, vector embeddings, and vector databases in AWS. With Retrieval Augmented Generation (RAG) being one of the most common frameworks used in generative AI solutions, the post explains costs in the context of a RAG solution and respective optimization pillars on Amazon Bedrock.
In Part 2 of this series, we will cover how to estimate business value and the influencing factors.
Designing performant and cost-effective generative AI applications is essential for realizing the full potential of this transformative technology and driving widespread adoption within your organization.
Forecasting and managing costs and performance in generative AI applications is driven by the following optimization pillars:
Let’s dive deeper to examine these factors and associated cost-optimization tips.
RAG helps an LLM answer questions specific to your corporate data, even though the LLM was never trained on your data.
As illustrated in the following diagram, the generative AI application reads your corporate trusted data sources, chunks it, generates vector embeddings, and stores the embeddings in a vector database. The vectors and data stored in a vector database are often called a knowledge base.
The generative AI application uses the vector embeddings to search and retrieve chunks of data that are most relevant to the user’s question and augment the question to generate the LLM response. The following diagram illustrates this workflow.
The workflow consists of the following steps:
Amazon Bedrock is a fully managed service providing access to high-performing FMs from leading AI providers through a unified API. It offers a wide range of LLMs to choose from.
In the preceding workflow, the generative AI application invokes Amazon Bedrock APIs to send text to an LLM like Amazon Titan Embeddings V2 to generate text embeddings, and to send prompts to an LLM like Anthropic’s Claude Haiku or Meta Llama to generate a response.
The generated text embeddings are stored in a vector database such as Amazon OpenSearch Service, Amazon Relational Database Service (Amazon RDS), Amazon Aurora, or Amazon MemoryDB.
A generative AI application such as a virtual assistant or support chatbot might need to carry a conversation with users. A multi-turn conversation requires the application to store a per-user question-answer history and send it to the LLM for additional context. This question-answer history can be stored in a database such as Amazon DynamoDB.
The generative AI application could also use Amazon Bedrock Guardrails to detect off-topic questions, ground responses to the knowledge base, detect and redact PII information, and detect and block hate or violence-related questions and answers.
Now that we have a good understanding of the various components in a RAG-based generative AI application, let’s explore how these factors influence costs while running your application in AWS using RAG.
Consider an organization that wants to help their customers with a virtual assistant that can answer their questions any time with a high degree of accuracy, performance, consistency, and safety. The performance and cost of the generative AI application depends directly on a few major factors in the environment, such as the velocity of questions per minute, the volume of questions per day (considering peak and off-peak), the amount of knowledge base data, and the LLM that is used.
Although this post explains the factors that influence costs, it can be useful to know the directional costs, based on some assumptions, to get a relative understanding of various cost components for a few scenarios such as small, medium, large, and extra large environments.
The following table is a snapshot of directional costs for four different scenarios with varying volume of user questions per month and knowledge base data.
. | SMALL | MEDIUM | LARGE | EXTRA LARGE |
INPUTs | 500,000 | 2,000,000 | 5,000,000 | 7,020,000 |
Total questions per month | 5 | 25 | 50 | 100 |
Knowledge base data size in GB (actual text size on documents) | . | . | . | . |
Annual costs (directional)* | . | . | . | . |
Amazon Bedrock On-Demand costs using Anthropic’s Claude 3 Haiku | $5,785 | $23,149 | $57,725 | $81,027 |
Amazon OpenSearch Service provisioned cluster costs | $6,396 | $13,520 | $20,701 | $39,640 |
Amazon Bedrock Titan Text Embedding v2 costs | $396 | $5,826 | $7,320 | $13,585 |
Total annual costs (directional) | $12,577 | $42,495 | $85,746 | $134,252 |
Unit cost per 1,000 questions (directional) | $2.10 | $1.80 | $1.40 | $1.60 |
These costs are based on assumptions. Costs will vary if assumptions change. Cost estimates will vary for each customer. The data in this post should not be used as a quote and does not guarantee the cost for actual use of AWS services. The costs, limits, and models can change over time.
For the sake of brevity, we use the following assumptions:
We refer to this table in the remainder of this post. In the next few sections, we will cover Amazon Bedrock costs and the key factors influences its costs, vector embedding costs, vector database costs, and Amazon Bedrock Guardrails costs. In the final section, we will cover how chunking strategies will influence some of the above cost components.
Amazon Bedrock has two pricing models: On-Demand (used in the preceding example scenario) and Provisioned Throughput.
With the On-Demand model, an LLM has a maximum requests (questions) per minute (RPM) and tokens per minute (TPM) limit. The RPM and TPM are typically different for each LLM. For more information, see Quotas for Amazon Bedrock.
In the extra large use case, with 7 million questions per month, assuming 10 hours per day and 22 business days per month, it translates to 532 questions per minute (532 RPM). This is well below the maximum limit of 1,000 RPM for Anthropic’s Claude 3 Haiku.
With 2,720 average tokens per question and 532 requests per minute, the TPM is 2,720 x 532 = 1,447,040, which is well below the maximum limit of 2,000,000 TPM for Anthropic’s Claude 3 Haiku.
However, assume that the user questions grow by 50%. The RPM, TPM, or both might cross the thresholds. In such cases where the generative AI application needs cross the On-Demand RPM and TPM thresholds, you should consider the Amazon Bedrock Provisioned Throughput model.
With Amazon Bedrock Provisioned Throughput, cost is based on a per-model unit basis. Model units are dedicated for the duration you plan to use, such as an hourly, 1-month, 6-month commitment.
Each model unit offers a certain capacity of maximum tokens per minute. Therefore, the number of model units (and the costs) are determined by the input and output TPM.
With Amazon Bedrock Provisioned Throughput, you incur charges per model unit whether you use it or not. Therefore, the Provisioned Throughput model is relatively more expensive than the On-Demand model.
Consider the following cost-optimization tips:
In this section, we discuss various factors that can influence costs.
Cost grows as the number of questions grow with the On-Demand model, as can be seen in the following figure for annual costs (based on the table discussed earlier).
The main sources of input tokens to the LLM are the system prompt, user prompt, context from the vector database (knowledge base), and context from QnA history, as illustrated in the following figure.
As the size of each component grows, the number of input tokens to the LLM grows, and so does the costs.
Generally, user prompts are relatively small. For example, in the user prompt “What are the performance and cost optimization strategies for Amazon DynamoDB?”, assuming four characters per token, there are approximately 20 tokens.
System prompts can be large (and therefore the costs are higher), especially for multi-shot prompts where multiple examples are provided to get LLM responses with better tone and style. If each example in the system prompt uses 100 tokens and there are three examples, that’s 300 tokens, which is quite larger than the actual user prompt.
Context from the knowledge base tends to be the largest. For example, when the documents are chunked and text embeddings are generated for each chunk, assume that the chunk size is 2,000 characters. Assume that the generative AI application sends three chunks relevant to the user prompt to the LLM. This is 6,000 characters. Assuming four characters per token, this translates to 1,500 tokens. This is much higher compared to a typical user prompt or system prompt.
Context from QnA history can also be high. Assume an average of 20 tokens in the user prompt and 100 tokens in LLM response. Assume that the generative AI application sends a history of three question-answer pairs along with each question. This translates to (20 tokens per question + 100 tokens per response) x 3 question-answer pairs = 360 tokens.
Consider the following cost-optimization tips:
The response from the LLM will depend on the user prompt. In general, the pricing for output tokens is three to five times higher than the pricing for input tokens.
Consider the following cost-optimization tips:
As explained previously, in a RAG application, the data is chunked, and text embeddings are generated and stored in a vector database (knowledge base). The text embeddings are generated by invoking the Amazon Bedrock API with an LLM, such as Amazon Titan Text Embeddings V2. This is independent of the Amazon Bedrock model you choose for inferencing, such as Anthropic’s Claude Haiku or other LLMs.
The pricing to generate text embeddings is based on the number of input tokens. The greater the data, the greater the input tokens, and therefore the higher the costs.
For example, with 25 GB of data, assuming four characters per token, input tokens total 6,711 million. With the Amazon Bedrock On-Demand costs for Amazon Titan Text Embeddings V2 as $0.02 per million tokens, the cost of generating embeddings is $134.22.
However, On-Demand has an RPM limit of 2,000 for Amazon Titan Text Embeddings V2. With 2,000 RPM, it will take 112 hours to embed 25 GB of data. Because this is a one-time job of embedding data, this might be acceptable in most scenarios.
For monthly change rate and new data of 5% (1.25 GB per month), the time required will be 6 hours.
In rare situations where the actual text data is very high in TBs, Provisioned Throughput will be needed to generate text embeddings. For example, to generate text embeddings for 500 GB in 3, 6, and 9 days, it will be approximately $60,000, $33,000, or $24,000 one-time costs using Provisioned Throughput.
Typically, the actual text inside a file is 5–10 times smaller than the file size reported by Amazon S3 or a file system. Therefore, when you see 100 GB size for all your files that need to be vectorized, there is a high probability that the actual text inside the files will be 2–20 GB.
One way to estimate the text size inside files is with the following steps:
AWS offers many vector databases, such as OpenSearch Service, Aurora, Amazon RDS, and MemoryDB. As explained earlier in this post, the vector database plays a critical role in grounding responses to your enterprise data whose vector embeddings are stored in a vector database.
The following are some of the factors that influence the costs of vector database. For the sake of brevity, we consider an OpenSearch Service provisioned cluster as the vector database.
Other factors, such as the number of retrievals from the vector database that you pass as context to the LLM, can influence input tokens and therefore costs. But in general, the preceding factors are the most important cost drivers.
Let’s assume your generative AI virtual assistant is supposed to answer questions related to your products for your customers on your website. How will you avoid users asking off-topic questions such as science, religion, geography, politics, or puzzles? How do you avoid responding to user questions on hate, violence, or race? And how can you detect and redact PII in both questions and responses?
The Amazon Bedrock ApplyGuardrail API can help you solve these problems. Guardrails offer multiple policies such as content filters, denied topics, contextual grounding checks, and sensitive information filters (PII). You can selectively apply these filters to all or a specific portion of data such as user prompt, system prompt, knowledge base context, and LLM responses.
Applying all filters to all data will increase costs. Therefore, you should evaluate carefully which filter you want to apply on what portion of data. For example, if you want PII to be detected or redacted from the LLM response, for 2 million questions per month, approximate costs (based on output tokens mentioned earlier in this post) would be $200 per month. In addition, if your security team wants to detect or redact PII for user questions as well, the total Amazon Bedrock Guardrails costs will be $400 per month.
As explained earlier in how RAG works, your data is chunked, embeddings are generated for those chunks, and the chunks and embeddings are stored in a vector database. These chunks of data are retrieved later and passed as context along with user questions to the LLM to generate a grounded and relevant response.
The following are different chunking strategies, each of which can influence costs:
The following table is a relative cost comparison for various chunking strategies.
Chunking Strategy | Standard | Semantic | Hierarchical |
Relative Inference Costs | Low | Medium | High |
In this post, we discussed various factors that could impact costs for your generative AI application. This a rapidly evolving space, and costs for the components we mentioned could change in the future. Consider the costs in this post as a snapshot in time that is based on assumptions and is directionally accurate. If you have any questions, reach out to your AWS account team.
In Part 2, we discuss how to calculate business value and the factors that impact business value.
100% Made with opensource tools: Flux, WAN2.1 Vace, MMAudio and DaVinci Resolve. submitted by /u/Race88…
The intersection of traditional machine learning and modern representation learning is opening up new possibilities.
We’re introducing an efficient, on-device robotics model with general-purpose dexterity and fast task adaptation.
Today we are excited to introduce the Text Ranking and Question and Answer UI templates…
Box is one of the original information sharing and collaboration platforms of the digital era.…
ChatEHR accelerates chart reviews for ER admissions, streamlines patient transfer summaries and synthesizes complex medical…