ML 19752 image 1
As organizations increasingly adopt AI capabilities across their applications, the need for centralized management, security, and cost control of AI model access is a required step in scaling AI solutions. The Generative AI Gateway on AWS guidance addresses these challenges by providing guidance for a unified gateway that supports multiple AI providers while offering comprehensive governance and monitoring capabilities.
The Generative AI Gateway is a reference architecture for enterprises looking to implement end-to-end generative AI solutions featuring multiple models, data-enriched responses, and agent capabilities in a self-hosted way. This guidance combines the broad model access of Amazon Bedrock, unified developer experience of Amazon SageMaker AI, and the robust management capabilities of LiteLLM, all while supporting customer access to models from external model providers in a more secure and reliable manner.
LiteLLM is an open source project that addresses common challenges faced by customers deploying generative AI workloads. LiteLLM simplifies multi-provider model access while standardizing production operational requirements including cost tracking, observability, prompt management, and more. In this post we’ll introduce how the Multi-Provider Generative AI Gateway reference architecture provides guidance for deploying LiteLLM into an AWS environment for production generative AI workload management and governance.
Organizations building with generative AI face several complex challenges as they scale their AI initiatives:
This guidance addresses these common customer challenges by providing a centralized gateway that abstracts the complexity of multiple AI providers behind a single, managed interface.
Built on AWS services and using the open source LiteLLM project, organizations can use this solution to integrate with AI providers while maintaining centralized control, security, and observability.
The Multi-Provider Generative AI Gateway supports multiple deployment patterns to meet diverse organizational needs:
Amazon ECS deployment
For teams preferring containerized applications with managed infrastructure, the ECS deployment provides serverless container orchestration with automatic scaling and integrated load balancing.
Amazon EKS deployment
Organizations with existing Kubernetes expertise can use the EKS deployment option, which provides full control over container orchestration while benefiting from a managed Kubernetes control plane. Customers can deploy a new cluster or leverage existing clusters for deployment.
The reference architecture provided for these deployment options is subject to additional security testing based on your organization’s specific security requirements. Conduct additional security testing and review as necessary before deploying anything into production.
The Multi-Provider Generative AI Gateway supports multiple network architecture options:
Global Public-Facing Deployment
For AI services with global user bases, combine the gateway with Amazon CloudFront (CloudFront) and Amazon Route 53. This configuration provides:
Regional direct access
For single-Region deployments prioritizing low latency and cost optimization, direct access to the Application Load Balancer (ALB) removes the CloudFront layer while maintaining security through properly configured security groups and network ACLs.
Private internal access
Organizations requiring complete isolation can deploy the gateway within a private VPC without internet exposure. This configuration makes sure that the AI model access remains within your secure network perimeter, with ALB security groups restricting traffic to authorized private subnet CIDRs only.
The Multi-Provider Generative AI Gateway is built to enable robust AI governance standards from a straightforward administrative interface. In addition to policy-based configuration and access management, users can configure advanced capabilities like load-balancing and prompt caching.
The Generative AI Gateway includes a web-based administrative interface in LiteLLM that supports comprehensive management of LLM usage across your organization.
Key capabilities include:
User and team management: Configure access controls at granular levels, from individual users to entire teams, with role-based permissions that align with your organizational structure.
API key management: Centrally manage and rotate API keys for the connected AI providers while maintaining audit trails of key usage and access patterns.
Budget controls and alerting: Set spending limits across providers, teams, and individual users with automated alerts when thresholds are approached or exceeded.
Comprehensive cost controls: Costs are influenced by AWS infrastructure and LLM providers. While it is the customer’s responsibility to configure this solution to meet their cost requirements, customers may review the existing cost settings for additional guidance.
Supports multiple model providers: Compatible with Boto3, OpenAI, and LangGraph SDK, allowing customers to use the best model for the workload regardless of the provider.
Support for Amazon Bedrock Guardrails: Customers can leverage guardrails created on Amazon Bedrock Guardrails for their generative AI workloads, regardless of the model provider.
Common considerations around model deployment include model and prompt resiliency. These factors are important to consider how failures are handled when responding to a prompt or accessing data stores.
Load balancing and failover: The gateway implements sophisticated routing logic that distributes requests across multiple model deployments and automatically fails over to backup providers when issues are detected.
Retry logic: Built-in retry mechanisms with exponential back-off facilitate reliable service delivery even when individual providers experience transient issues.
Prompt caching: Intelligent caching helps reduce costs by avoiding duplicate requests to expensive AI models while maintaining response accuracy.
Model deployment architecture can range from the simple to highly complex. The Multi-Provider Generative AI Gateway features the advanced policy management tools needed to maintain a strong governance posture.
Rate limiting: Configure sophisticated rate limiting policies that can vary by user, API key, model type, or time of day to facilitate fair resource allocation and help prevent abuse.
Model access controls: Restrict access to specific AI models based on user roles, making sure that sensitive or expensive models are only accessible to authorized personnel.
Custom routing rules: Implement business logic that routes requests to specific providers based on criteria such as request type, user location, or cost optimization requirements.
As AI workloads grow to include more components, so to do observability needs. The Multi-Provider Generative AI Gateway architecture integrates with Amazon CloudWatch. This integration enables users to configure myriad monitoring and observability solutions, including open-source tools such as Langfuse.
The gateway interactions are automatically logged to CloudWatch, providing detailed insights into:
The administrative interface provides real-time log viewing capabilities so administrators can quickly diagnose and resolve usage issues without needing to access CloudWatch directly.
Amazon SageMaker helps enhance the Multi-Provider Generative AI Gateway guidance by providing a comprehensive machine learning system that seamlessly integrates with the gateway’s architecture. By using the Amazon SageMaker managed infrastructure for model training, deployment, and hosting, organizations can develop custom foundation models or fine-tune existing ones that can be accessed through the gateway alongside models from other providers. This integration removes the need for separate infrastructure management while facilitating consistent governance across both custom and third-party models. SageMaker AI model hosting capabilities expands the gateway’s model access to include self-hosted models, as well as those available on Amazon Bedrock, OpenAI, and other providers.
This reference architecture builds upon our contributions to the LiteLLM open source project, enhancing its capabilities for enterprise deployment on AWS. Our enhancements include improved error handling, enhanced security features, and optimized performance for cloud-native deployments.
The Multi-Provider Generative AI Gateway reference architecture is available today through our GitHub repository, complete with:
The code repository describes several flexible deployment options to get started.
Use CloudFront to provide a globally distributed, low-latency access point for your generative AI services. The CloudFront edge locations deliver content quickly to users around the world, while AWS Shield Standard helps protect against DDoS attacks. This is the recommended configuration for public-facing AI services with a global user base.
For a more branded experience, you can configure the gateway to use your own custom domain name, while still benefiting from the performance and security features of CloudFront. This option is ideal if you want to maintain consistency with your company’s online presence.
Customers who prioritize low-latency over global distribution can opt for a direct-to-ALB deployment, without the CloudFront layer. This simplified architecture can offer cost savings, though it requires extra consideration for web application firewall protection.
For a high level of security, you can deploy the gateway entirely within a private VPC, isolated from the public internet. This configuration is well-suited for processing sensitive data or deploying internal-facing generative AI services. Access is restricted to trusted networks like VPN, Direct Connect, VPC peering, or AWS Transit Gateway.
Ready to simplify your multi-provider AI infrastructure? Access the complete solution package to explore an interactive learning experience with step-by-step guidance describing each step of the deployment and management process.
The Multi-Provider Generative AI Gateway is a solution guidance intended to help customers get started working on generative AI solutions in a well-architected manner, while taking advantage of the AWS environment of services and complimentary open-source packages. Customers can work with models from Amazon Bedrock, Amazon SageMaker JumpStart, or third-party model providers. Operations and management of workloads is conducted via the LiteLLM management interface, and customers can choose to host on ECS or EKS based on their preference.
In addition, we have published a sample that integrates the gateway into an agentic customer service application. The agentic system is orchestrated using LangGraph and deployed on Amazon Bedrock AgentCore. LLM calls are routed through the gateway, providing the flexibility to test agents with different models–whether hosted on AWS or another provider.
This guidance is just one part of a mature generative AI foundation on AWS. For deeper reading on the components of a generative AI system on AWS, see Architect a mature generative AI foundation on AWS, which describes additional components of a generative AI system.
TL;DR AI is already raising unemployment in knowledge industries, and if AI continues progressing toward…
The canonical approach in generative modeling is to split model fitting into two blocks: define…
From uncovering new insights in multimodal data to personalizing customer experiences, AI is emerging as…
OpenAI has sent out emails notifying API customers that its chatgpt-4o-latest model will be retired…
A message on OpenAI’s internal Slack claimed the activist in question had expressed interest in…