Bedrock SAW
As AWS environments grow in complexity, troubleshooting issues with resources can become a daunting task. Manually investigating and resolving problems can be time-consuming and error-prone, especially when dealing with intricate systems. Fortunately, AWS provides a powerful tool called AWS Support Automation Workflows, which is a collection of curated AWS Systems Manager self-service automation runbooks. These runbooks are created by AWS Support Engineering with best practices learned from solving customer issues. They enable AWS customers to troubleshoot, diagnose, and remediate common issues with their AWS resources.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.
In this post, we explore how to use the power of Amazon Bedrock Agents and AWS Support Automation Workflows to create an intelligent agent capable of troubleshooting issues with AWS resources.
Although the solution is versatile and can be adapted to use a variety of AWS Support Automation Workflows, we focus on a specific example: troubleshooting an Amazon Elastic Kubernetes Service (Amazon EKS) worker node that failed to join a cluster. The following diagram provides a high-level overview of troubleshooting agents with Amazon Bedrock.
Our solution is built around the following key components that work together to provide a seamless and efficient troubleshooting experience:
The following steps outline the workflow of our solution:
During the reasoning phase of the agent, the user is able to view the reasoning steps.
Let’s take a closer look at a common issue we mentioned earlier and how our agent can assist in troubleshooting it.
When an EKS worker node fails to join an EKS cluster, our Amazon Bedrock agent can be invoked with the relevant information: cluster name and worker node ID. The agent will execute the corresponding AWS Support Automation Workflow, which will perform checks like verifying the worker node’s IAM role permissions and verifying the necessary network connectivity.
The automation workflow will run all the checks. Then Amazon Bedrock agent will ingest the troubleshooting, explain the root cause of the issue to the user, and suggest remediation steps based on the AWSSupport-TroubleshootEKSWorkerNode
output, such as updating the worker node’s IAM role or resolving network configuration issues, enabling them to take the necessary actions to resolve the problem.
When you create an action group in Amazon Bedrock, you must define the parameters that the agent needs to invoke from the user. You can also define API operations that the agent can invoke using these parameters. To define the API operations, we will create an OpenAPI schema in JSON:
"Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post": {
"properties": {
"cluster_name": {
"type": "string",
"title": "Cluster Name",
"description": "The name of the EKS cluster"
},
"worker_id": {
"type": "string",
"title": "Worker Id",
"description": "The ID of the worker node"
}
},
"type": "object",
"required": [
"cluster_name",
"worker_id"
],
"title": "Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post"
}
The schema consists of the following components:
troubleshoot-eks-worker_node
POST endpoint.The OpenAPI schema defines the structure of the request body. To learn more, see Define OpenAPI schemas for your agent’s action groups in Amazon Bedrock and OpenAPI specification.
Now let’s explore the Lambda function code:
@app.post("/troubleshoot-eks-worker-node")
@tracer.capture_method
def troubleshoot_eks_worker_node(
cluster_name: Annotated[str, Body(description="The name of the EKS cluster")],
worker_id: Annotated[str, Body(description="The ID of the worker node")]
) -> dict:
"""
Troubleshoot EKS worker node that failed to join the cluster.
Args:
cluster_name (str): The name of the EKS cluster.
worker_id (str): The ID of the worker node.
Returns:
dict: The output of the Automation execution.
"""
return execute_automation(
automation_name='AWSSupport-TroubleshootEKSWorkerNode',
parameters={
'ClusterName': [cluster_name],
'WorkerID': [worker_id]
},
execution_mode='TroubleshootWorkerNode'
)
The code consists of the following components
/troubleshoot-eks-worker-node
endpoint. The description parameter provides a brief explanation of what this endpoint does.cluster_name
is a string type and is expected to be passed in the request body. The Body decorator is used to indicate that this parameter should be extracted from the request body. The description parameter provides a brief explanation of what this parameter represents.worker_id
is a string type and is expected to be passed in the request body.To link a new SAW runbook in the Lambda function, you can follow the same template.
Make sure you have the following prerequisites:
Complete the following steps to deploy the solution:
$ git clone https://github.com/aws-samples/sample-bedrock-agent-for-troubleshooting-aws-resources.git
$ cd bedrock-agent-for-troubleshooting-aws-resources
$ npm install
$ export AWS_PROFILE=PROFILE_NAME
$ cdk bootstrap
$ cdk deploy --all
Navigate to the Amazon Bedrock Agents console in your Region and find your deployed agent. You will find the agent ID in the cdk deploy
command output.
You can now interact with the agent and test troubleshooting a worker node not joining an EKS cluster. The following are some example questions:
The following screenshot shows the console view of the agent.
The agent understood the question and mapped it with the right action group. It also spotted that the parameters needed are missing in the user prompt. It came back with a follow-up question to require the Amazon Elastic Compute Cloud (Amazon EC2) instance ID and EKS cluster name.
We can see the agent’s thought process in the trace step 1. The agent assesses the next step as ready to call the right Lambda function and right API path.
With the results coming back from the runbook, the agent now reviews the troubleshooting outcome. It goes through the information and will start writing the solution where it provides the instructions for the user to follow.
In the answer provided, the agent was able to spot all the issues and transform that into solution steps. We can also see the agent mentioning the right information like IAM policy and the required tag.
When implementing Amazon Bedrock Agents, there are no additional charges for resource construction. However, costs are incurred for embedding model and text model invocations on Amazon Bedrock, with charges based on the pricing of each FM used. In this use case, you will also incur costs for Lambda invocations.
To avoid incurring future charges, delete the created resources by the AWS CDK. From the root of your repository folder, run the following command:
$ npm run cdk destroy --all
Amazon Bedrock Agents and AWS Support Automation Workflows are powerful tools that, when combined, can revolutionize AWS resource troubleshooting. In this post, we explored a serverless application built with the AWS CDK that demonstrates how these technologies can be integrated to create an intelligent troubleshooting agent. By defining action groups within the Amazon Bedrock agent and associating them with specific scenarios and automation workflows, we’ve developed a highly efficient process for diagnosing and resolving issues such as Amazon EKS worker node failures.
Our solution showcases the potential for automating complex troubleshooting tasks, saving time and streamlining operations. Powered by Anthropic’s Claude 3.5 Sonnet, the agent demonstrates improved understanding and responding in languages other than English, such as French, Japanese, and Spanish, making it accessible to global teams while maintaining its technical accuracy and effectiveness. The intelligent agent quickly identifies root causes and provides actionable insights, while automatically executing relevant AWS Support Automation Workflows. This approach not only minimizes downtime, but also scales effectively to accommodate various AWS services and use cases, making it a versatile foundation for organizations looking to enhance their AWS infrastructure management.
Explore the AWS Support Automation Workflow for additional use cases and consider using this solution as a starting point for building more comprehensive troubleshooting agents tailored to your organization’s needs. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.
Happy coding!
The authors thank all the reviewers for their valuable feedback.
It's actually been out for a few days but since I haven't found any discussion…
Empathy and trust are not optional. They are essential for scaling change and encouraging innovation,…
The US concentrated its attack on Fordow, an enrichment plant built hundreds of feet underground.…
AI is revolutionizing the job landscape, prompting nations worldwide to prepare their workforces for dramatic…
Here's v2 of a project I started a few days ago. This will probably be…
We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance…