Enterprises are seeking to quickly unlock the potential of generative AI by providing access to foundation models (FMs) to different lines of business (LOBs). IT teams are responsible for helping the LOB innovate with speed and agility while providing centralized governance and observability. For example, they may need to track the usage of FMs across teams, chargeback costs and provide visibility to the relevant cost center in the LOB. Additionally, they may need to regulate access to different models per team. For example, if only specific FMs may be approved for use.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.
A software as a service (SaaS) layer for foundation models can provide a simple and consistent interface for end-users, while maintaining centralized governance of access and consumption. API gateways can provide loose coupling between model consumers and the model endpoint service, and flexibility to adapt to changing model, architectures, and invocation methods.
In this post, we show you how to build an internal SaaS layer to access foundation models with Amazon Bedrock in a multi-tenant (team) architecture. We specifically focus on usage and cost tracking per tenant and also controls such as usage throttling per tenant. We describe how the solution and Amazon Bedrock consumption plans map to the general SaaS journey framework. The code for the solution and an AWS Cloud Development Kit (AWS CDK) template is available in the GitHub repository.
An AI platform administrator needs to provide standardized and easy access to FMs to multiple development teams.
The following are some of the challenges to provide governed access to foundation models:
In this solution, we refer to a multi-tenant approach. A tenant here can range from an individual user, a specific project, team, or even an entire department. As we discuss the approach, we use the term team, because it’s the most common. We use API keys to restrict and monitor API access for teams. Each team is assigned an API key for access to the FMs. There can be different user authentication and authorization mechanisms deployed in an organization. For simplicity, we do not include these in this solution. You may also integrate existing identity providers with this solution.
The following diagram summarizes the solution architecture and key components. Teams (tenants) assigned to separate cost centers consume Amazon Bedrock FMs via an API service. To track consumption and cost per team, the solution logs data for each individual invocation, including the model invoked, number of tokens for text generation models, and image dimensions for multi-modal models. In addition, it aggregates the invocations per model and costs by each team.
You can deploy the solution in your own account using the AWS CDK. AWS CDK is an open source software development framework to model and provision your cloud application resources using familiar programming languages. The AWS CDK code is available in the GitHub repository.
In the following sections, we discuss the key components of the solution in more detail.
The workflow to capture FM usage per team consists of the following steps (as numbered in the preceding diagram):
model_id
query parameter and the user prompt in the request body.bedrock_invoke_model
) that’s responsible for logging team usage information in Amazon CloudWatch and invoking the Amazon Bedrock model.A different flow aggregates the usage information, then calculates and saves the on-demand costs per team on a daily basis. By having a separate flow, we ensure that cost tracking doesn’t impact the latency and throughput of the model invocation flow. The workflow steps are as follows:
bedrock_cost_tracking
) daily.team_id
and model_id
in Amazon Simple Storage Service (Amazon S3) in CSV format.To query and visualize the data stored in Amazon S3, you have different options, including S3 Select, and Amazon Athena and Amazon QuickSight.
A usage plan specifies who can access one or more deployed APIs and optionally sets the target request rate to start throttling requests. The plan uses API keys to identify API clients who can access the associated API for each key. You can use API Gateway usage plans to throttle requests that exceed predefined thresholds. You can also use API keys and quota limits, which enable you to set the maximum number of requests per API key each team is permitted to issue within a specified time interval. This is in addition to Amazon Bedrock service quotas that are assigned only at the account level.
Before you deploy the solution, make sure you have the following:
Follow the instructions in the README file of the GitHub repository to configure and deploy the AWS CDK stack.
The stack deploys the following resources:
invoke_model
list_foundation_models
cost_tracking
For providing access to new teams, you can either share the same API key across different teams and track the model consumptions by providing a different team_id
for the API invocation, or create dedicated API keys used for accessing Amazon Bedrock resources by following the instructions provided in the README.
The stack deploys the following resources:
For more information about API Gateway throttling and burst configurations, refer to Throttle API requests for better throughput.
After you deploy the stack, you can see that the new API key for team-2
is created as well.
The platform administrator can allow access to specific foundation models by editing the IAM policy associated to the Lambda function invoke_model
. The
IAM permissions are defined in the file setup/stack_constructs/iam.py. See the following code:
After you have deployed the solution, you can invoke the service directly from your code. The following
is an example in Python for consuming the invoke_model
API for text generation through a POST request:
Output: Amazon Bedrock is an internal technology platform developed by Amazon to run and operate many of their services and products. Some key things about Bedrock …
The following is another example in Python for consuming the invoke_model
API for embeddings generation through a POST request:
model_id = "amazon.titan-embed-text-v1" #the model id for the Amazon Titan Embeddings Text model
prompt = "What is Amazon Bedrock?"
response = requests.post(
f"{api_url}/invoke_model?model_id={model_id}",
json={"inputs": prompt, "parameters": model_kwargs},
headers={
"x-api-key": api_key, #key for querying the API
"team_id": team_id #unique tenant identifier,
"embeddings": "true" #boolean value for the embeddings model
}
)
text = response.json()[0]["embedding"]
Output: 0.91796875, 0.45117188, 0.52734375, -0.18652344, 0.06982422, 0.65234375, -0.13085938, 0.056884766, 0.092285156, 0.06982422, 1.03125, 0.8515625, 0.16308594, 0.079589844, -0.033935547, 0.796875, -0.15429688, -0.29882812, -0.25585938, 0.45703125, 0.044921875, 0.34570312 …
The following is an example in Python for consuming the invoke_model
API for text generation through a POST request with an access denied response:
<Response [500]> “Traceback (most recent call last):n File ”/var/task/index.py”, line 213, in lambda_handlern response = _invoke_text(bedrock_client, model_id, body, model_kwargs)n File ”/var/task/index.py”, line 146, in _invoke_textn raise en File ”/var/task/index.py”, line 131, in _invoke_textn response = bedrock_client.invoke_model(n File ”/opt/python/botocore/client.py”, line 535, in _api_calln return self._make_api_call(operation_name, kwargs)n File ”/opt/python/botocore/client.py”, line 980, in _make_api_calln raise error_class(parsed_response, operation_name)nbotocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the InvokeModel operation: Your account is not authorized to invoke this API operation.n”
When invoking Amazon Bedrock models with on-demand pricing, the total cost is calculated as the sum of the input and output costs. Input costs are based on the number of input tokens sent to the model, and output costs are based on the tokens generated. The prices are per 1,000 input tokens and per 1,000 output tokens. For more details and specific model prices, refer to Amazon Bedrock Pricing.
Let’s look at an example where two teams, team1 and team2, access Amazon Bedrock through the solution in this post. The usage and cost data saved in Amazon S3 in a single day is shown in the following table.
The columns input_tokens
and output_tokens
store the total input and output tokens across model invocations per model and per team, respectively, for a given day.
The columns input_cost
and output_cost
store the respective costs per model and per team. These are calculated using the following formulas:
input_cost = input_token_count * model_pricing["input_cost"] / 1000
output_cost = output_token_count * model_pricing["output_cost"] / 1000
team_id | model_id | input_tokens | output_tokens | invocations | input_cost | output_cost |
Team1 | amazon.titan-tg1-large | 24000 | 2473 | 1000 | 0.0072 | 0.00099 |
Team1 | anthropic.claude-v2 | 2448 | 4800 | 24 | 0.02698 | 0.15686 |
Team2 | amazon.titan-tg1-large | 35000 | 52500 | 350 | 0.0105 | 0.021 |
Team2 | ai21.j2-grande-instruct | 4590 | 9000 | 45 | 0.05738 | 0.1125 |
Team2 | anthropic.claude-v2 | 1080 | 4400 | 20 | 0.0119 | 0.14379 |
Let’s understand what an end-to-end functional multi-tenant serverless SaaS environment might look like. The following is a reference architecture diagram.
This architecture diagram is a zoomed-out version of the previous architecture diagram explained earlier in the post, where the previous architecture diagram explains the details of one of the microservices mentioned (foundational model service). This diagram explains that, apart from foundational model service, you need to have other components as well in your multi-tenant SaaS platform to implement a functional and scalable platform.
Let’s go through the details of the architecture.
The tenant applications are the front end applications that interact with the environment. Here, we show multiple tenants accessing from different local or AWS environments. The front end applications can be extended to include a registration page for new tenants to register themselves and an admin console for administrators of the SaaS service layer. If the tenant applications require a custom logic to be implemented that needs interaction with the SaaS environment, they can implement the specifications of the application adaptor microservice. Example scenarios could be adding custom authorization logic while respecting the authorization specifications of the SaaS environment.
The following are shared services:
This service presents a set of specifications and APIs that a tenant may implement in order to integrate their custom logic to the SaaS environment. Based on how much custom integration is needed, this component can be optional for tenants.
The shared services store their data in a data store that can be a single shared Amazon DynamoDB table with a tenant partitioning key that associates DynamoDB items with individual tenants. The cost tracking shared service outputs the aggregated usage and cost tracking data to Amazon S3. Based on the use case, there can be an application-specific data store as well.
A multi-tenant SaaS environment can have a lot more components. For more information, refer to Building a Multi-Tenant SaaS Solution Using AWS Serverless Services.
SaaS frameworks typically outline two deployment models: pool and silo. For the pool model, all tenants access FMs from a shared environment with common storage and compute infrastructure. In the silo model, each tenant has its own set of dedicated resources. You can read about isolation models in the SaaS Tenant Isolation Strategies whitepaper.
The proposed solution can be adopted for both SaaS deployment models. In the pool approach, a centralized AWS environment hosts the API, storage, and compute resources. In silo mode, each team accesses APIs, storage, and compute resources in a dedicated AWS environment.
The solution also fits with the available consumption plans provided by Amazon Bedrock. AWS provides a choice of two consumptions plan for inference:
For more information about these options, refer to Amazon Bedrock Pricing.
The serverless SaaS reference solution described in this post can apply the Amazon Bedrock consumption plans to provide basic and premium tiering options to end-users. Basic could include On-Demand or Provisioned Throughput consumption of Amazon Bedrock and could include specific usage and budget limits. Tenant limits could be enabled by throttling requests based on requests, token sizes, or budget allocation. Premium tier tenants could have their own dedicated resources with provisioned throughput consumption of Amazon Bedrock. These tenants would typically be associated with production workloads that require high throughput and low latency access to Amazon Bedrock FMs.
In this post, we discussed how to build an internal SaaS platform to access foundation models with Amazon Bedrock in a multi-tenant setup with a focus on tracking costs and usage, and throttling limits for each tenant. Additional topics to explore include integrating existing authentication and authorization solutions in the organization, enhancing the API layer to include web sockets for bi-directional client server interactions, adding content filtering and other governance guardrails, designing multiple deployment tiers, integrating other microservices in the SaaS architecture, and many more.
The entire code for this solution is available in the GitHub repository.
For more information about SaaS-based frameworks, refer to SaaS Journey Framework: Building a New SaaS Solution on AWS.
The report The economic potential of generative AI: The next productivity frontier, published by McKinsey…
The new model shows open-source closing in on closed-source models, suggesting reduced chances of one…
Samsung’s celebrated flagship soundbar does just enough to beat out the rest of its Dolby…
Even highly realistic androids can cause unease when their facial expressions lack emotional consistency. Traditionally,…
These beard tools deliver a quality trim for all types of facial hair.
Artificial intelligence (AI) research, particularly in the machine learning (ML) domain, continues to increase the…