blog updatedrawio
As organizations scale their Amazon Elastic Kubernetes Service (Amazon EKS) deployments, platform administrators face increasing challenges in efficiently managing multi-tenant clusters. Tasks such as investigating pod failures, addressing resource constraints, and resolving misconfiguration can consume significant time and effort. Instead of spending valuable engineering hours manually parsing logs, tracking metrics, and implementing fixes, teams should focus on driving innovation. Now, with the power of generative AI, you can transform your Kubernetes operations. By implementing intelligent cluster monitoring, pattern analysis, and automated remediation, you can dramatically reduce both mean time to identify (MTTI) and mean time to resolve (MTTR) for common cluster issues.
At AWS re:Invent 2024, we announced the multi-agent collaboration capability for Amazon Bedrock (preview). With multi-agent collaboration, you can build, deploy, and manage multiple AI agents working together on complex multistep tasks that require specialized skills. Because troubleshooting an EKS cluster involves deriving insights from multiple observability signals and applying fixes using a continuous integration and deployment (CI/CD) pipeline, a multi-agent workflow can help an operations team streamline the management of EKS clusters. The workflow manager agent can integrate with individual agents that interface with individual observability signals and a CI/CD workflow to orchestrate and perform tasks based on user prompt.
In this post, we demonstrate how to orchestrate multiple Amazon Bedrock agents to create a sophisticated Amazon EKS troubleshooting system. By enabling collaboration between specialized agents—deriving insights from K8sGPT and performing actions through the ArgoCD framework—you can build a comprehensive automation that identifies, analyzes, and resolves cluster issues with minimal human intervention.
The architecture consists of the following core components:
The following diagram illustrates the solution architecture.
You need to have the following prerequisites in place:
us-east-1
AWS Region. Hence, it’s restricted to running in the us-east-1
Region only.We start with installing and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.
The K8sGPT operator will help with enabling AI-powered analysis and troubleshooting of cluster issues. For example, it can automatically detect and suggest fixes for misconfigured deployments, such as identifying and resolving resource constraint problems in pods.
ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes that automates the deployment of applications by keeping the desired application state in sync with what’s defined in a Git repository.
The Amazon Bedrock agent serves as the intelligent decision-maker in our architecture, analyzing cluster issues detected by K8sGPT. After the root cause is identified, the agent orchestrates corrective actions through ArgoCD’s GitOps engine. This powerful integration means that when problems are detected (whether it’s a misconfigured deployment, resource constraints, or scaling issue), the agent can automatically integrate with ArgoCD to provide the necessary fixes. ArgoCD then picks up these changes and synchronizes them with your EKS cluster, creating a truly self-healing infrastructure.
kubectl create ns helm-guestbook
kubectl create ns k8sgpt-operator-system
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install k8sgpt-operator k8sgpt/k8sgpt-operator
--namespace k8sgpt-operator-system
kubectl get pods -n k8sgpt-operator-system
NAME READY STATUS RESTARTS AGE
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd 2/2 Running 0 1d
After the operator is deployed, you can configure a K8sGPT resource. This Custom Resource Definition(CRD) will have the large language model (LLM) configuration that will aid in AI-powered analysis and troubleshooting of cluster issues. K8sGPT supports various backends to help in AI-powered analysis. For this post, we use Amazon Bedrock as the backend and Anthropic’s Claude V3 as the LLM.
eksctl create podidentityassociation --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:policy/AmazonBedrockFullAccess --region $AWS_REGION
cat << EOF > k8sgpt.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-bedrock
namespace: k8sgpt-operator-system
spec:
ai:
enabled: true
model: anthropic.claude-v3
backend: amazonbedrock
region: us-east-1
credentials:
secretRef:
name: k8sgpt-secret
namespace: k8sgpt-operator-system
noCache: false
repository: ghcr.io/k8sgpt-ai/k8sgpt
version: v0.3.48
EOF
kubectl apply -f k8sgpt.yaml
kubectl get pods -n k8sgpt-operator-system
NAME READY STATUS RESTARTS AGE
k8sgpt-bedrock-5b655cbb9b-sn897 1/1 Running 9 (22d ago) 22d
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd 2/2 Running 3 (10h ago) 22d
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
kubectl create namespace argocd
helm install argocd argo/argo-cd
--namespace argocd
--create-namespace
kubectl get pods -n argocd
NAME READY STATUS RESTARTS AGE
argocd-application-controller-0 1/1 Running 0 43d
argocd-applicationset-controller-5c787df94f-7jpvp 1/1 Running 0 43d
argocd-dex-server-55d5769f46-58dwx 1/1 Running 0 43d
argocd-notifications-controller-7ccbd7fb6-9pptz 1/1 Running 0 43d
argocd-redis-587d59bbc-rndkp 1/1 Running 0 43d
argocd-repo-server-76f6c7686b-rhjkg 1/1 Running 0 43d
argocd-server-64fcc786c-bd2t8 1/1 Running 0 43d
kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'
kubectl get svc argocd-server -n argocd
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argocd-server LoadBalancer 10.100.168.229 a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com 80:32334/TCP,443:32261/TCP 43d
export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret
-o jsonpath="{.data.password}" | base64 -d`
echo ArgoCD admin password - $argocdpassword
aws secretsmanager create-secret
--name argocdcreds
--description "Credentials for argocd"
--secret-string "{"USERNAME":"admin","PASSWORD":"$argocdpassword"}"
cat << EOF > argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: helm-guestbook
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/awsvikram/argocd-example-apps
targetRevision: HEAD
path: helm-guestbook
destination:
server: https://kubernetes.default.svc
namespace: helm-guestbook
syncPolicy:
automated:
prune: true
selfHeal: true
EOF
kubectl apply -f argocd-application.yaml
kubectl -n k8sgpt-operator-system rollout restart deploy
deployment.apps/k8sgpt-bedrock restarted
deployment.apps/k8sgpt-operator-controller-manager restarted
We use a CloudFormation stack to deploy the individual agents into the US East (N. Virginia) Region. When you deploy the CloudFormation template, you deploy several resources (costs will be incurred for the AWS resources used).
Use the following parameters for the CloudFormation template:
kubectl get service argocd-server -n argocd -ojsonpath="{.status.loadBalancer.ingress[0].hostname}"
The stack creates the following AWS Lambda functions:
<Stack name>-LambdaK8sGPTAgent-<auto-generated>
<Stack name>-RestartRollBackApplicationArgoCD-<auto-generated>
<Stack name>-ArgocdIncreaseMemory-<auto-generated>
The stack creates the following Amazon Bedrock agents:
ArgoCDAgent
, with the following action groups: argocd-rollback
argocd-restart
argocd-memory-management
K8sGPTAgent
, with the following action group: k8s-cluster-operations
CollaboratorAgent
The stack outputs the following, with the following agents associated to it:
ArgoCDAgent
K8sGPTAgent
K8sGPTAgentAliasId
, ID of the K8sGPT Amazon Bedrock agent aliasArgoCDAgentAliasId
, ID of the ArgoCD Amazon Bedrock Agent aliasCollaboratorAgentAliasId
, ID of the collaborator Amazon Bedrock agent aliasTo enable the K8sGPT Amazon Bedrock agent to access the EKS cluster, you need to configure the appropriate IAM permissions using Amazon EKS access management APIs. This is a two-step process: first, you create an access entry for the Lambda function’s execution role (which you can find in the CloudFormation template output section), and then you associate the AmazonEKSViewPolicy
to grant read-only access to the cluster. This configuration makes sure that the K8sGPT agent has the necessary permissions to monitor and analyze the EKS cluster resources while maintaining the principle of least privilege.
export CFN_STACK_NAME=EKS-Troubleshooter
export EKS_CLUSTER=PetSite
export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output text`
aws eks create-access-entry
--cluster-name $EKS_CLUSTER
--principal-arn $K8SGPT_LAMBDA_ROLE
aws eks associate-access-policy
--cluster-name $EKS_CLUSTER
--principal-arn $K8SGPT_LAMBDA_ROLE
--policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy
--access-scope type=cluster
Now, test the solution. We explore the following two scenarios:
In this section, we examine a down alert for a sample application called memory-demo. We’re interested in the root cause of the issue. We use the following prompt: “We got a down alert for the memory-demo app. Help us with the root cause of the issue.”
The agent not only stated the root cause, but went one step further to potentially fix the error, which in this case is increasing memory resources to the application.
For this scenario, we continue from the previous prompt. We feel the application wasn’t provided enough memory, and it should be increased to permanently fix the issue. We can also tell the application is in an unhealthy state in the ArgoCD UI, as shown in the following screenshot.
Let’s now proceed to increase the memory, as shown in the following screenshot.
The agent interacted with the argocd_operations
Amazon Bedrock agent and was able to successfully increase the memory. The same can be inferred in the ArgoCD UI.
If you decide to stop using the solution, complete the following steps:
By orchestrating multiple Amazon Bedrock agents, we’ve demonstrated how to build an AI-powered Amazon EKS troubleshooting system that simplifies Kubernetes operations. This integration of K8sGPT analysis and ArgoCD deployment automation showcases the powerful possibilities when combining specialized AI agents with existing DevOps tools. Although this solution represents advancement in automated Kubernetes operations, it’s important to remember that human oversight remains valuable, particularly for complex scenarios and strategic decisions.
As Amazon Bedrock and its agent capabilities continue to evolve, we can expect even more sophisticated orchestration possibilities. You can extend this solution to incorporate additional tools, metrics, and automation workflows to meet your organization’s specific needs.
To learn more about Amazon Bedrock, refer to the following resources:
Discover how latent bridge matching, pioneered by the Jasper research team, transforms image-to-image translation with…
Machine learning models have become increasingly sophisticated, but this complexity often comes at the cost…
Gemini 2.5 Pro continues to be loved by developers as the best model for coding,…
Amazon SageMaker Projects empower data scientists to self-serve Amazon Web Services (AWS) tooling and infrastructure to…
Welcome to the second Cloud CISO Perspectives for May 2025. Today, Enrique Alvarez, public sector…