ML 19630 1
Protein researchers face a time-consuming challenge: manually searching through thousands of peptide sequences to find structurally similar candidates is slow, error-prone, and requires deep domain expertise to interpret results. Building a protein research copilot can transform how researchers search for structurally similar peptides across large datasets — enabling natural language queries, automated embedding generation, and AI-powered result summarization in a single conversational interface.
This post shows you how to build a conversational protein research assistant that combines three capabilities:
The system uses the Strands Agents SDK to orchestrate three specialized tools within one agent, deploys to Amazon Bedrock AgentCore for production serving, and stores peptide embeddings in Amazon Aurora PostgreSQL-Compatible Edition with pgvector.
By the end of this post, you will have built an end-to-end agent application that demonstrates how to:
To follow along with this post, you need:
bedrock-agentcore-starter-toolkit installed (pip install bedrock-agentcore-starter-toolkit).The copilot follows a tool-use pattern where a single Strands agent orchestrates three specialized tools to handle the complete research workflow. When a researcher submits a natural language query, the agent parses it into structured parameters, searches for similar peptides using protein embeddings, and summarizes the results with scientific context.
The following diagram illustrates the architecture:
This architecture has five components:
@tool decorator.This single-runtime, multi-tool design keeps the deployment simple while maintaining clear separation of concerns. Each tool encapsulates a distinct capability, and the orchestrator agent decides when and how to invoke them based on the user’s query.
The core of the similarity search is ESM-C 300M, a protein language model from EvolutionaryScale (Built with ESM) that produces 960-dimensional embeddings capturing structural and functional properties of amino acid sequences. Two peptides with similar biological function produce embeddings that are close in vector space, enabling similarity search without requiring sequence alignment.
ESM-C 300M is deployed as an Amazon SageMaker AI serverless endpoint, which scales to zero when idle and incurs no cost between invocations. The model weights are bundled into the deployment artifact to avoid downloading from HuggingFace at inference time — critical for serverless endpoints where cold start latency matters.
The inference handler constructs the model architecture directly and loads pre-packaged weights:
The predict_fn handler takes a protein sequence, encodes it, and returns the mean-pooled embedding:
The endpoint is deployed as a serverless configuration with 6144 MB memory and a max concurrency of 5, using the PyTorch 2.6.0 CPU inference container. The model packaging script downloads weights once via from_pretrained, saves the state dict, and bundles it with the inference code into a model.tar.gz with the required code/ directory structure for SageMaker AI.
Peptide embeddings are stored in Amazon Aurora PostgreSQL-Compatible Edition Serverless v2 with the pgvector extension. The database schema is straightforward:
The properties JSONB column stores biological metadata — species, source organism, source molecule, epitope positions — enabling combined vector and metadata filtering. For example, a query like “Find peptides similar to LPAIVREAI from dengue virus” triggers both a cosine similarity search on the embedding column and a filter on properties->>'species'.
The data loading pipeline reads from the IEDB virus epitope dataset, generates embeddings for each peptide sequence via the SageMaker AI endpoint, and inserts them into the database using the Amazon RDS Data API. The initial load samples 1,000 linear peptides:
Database access goes through the Amazon Relational Database Service (Amazon RDS) Data API, which means the agent runtime does not need direct network connectivity to the database — it communicates over HTTPS, simplifying the networking requirements for AgentCore deployment.
The Strands Agents SDK provides a clean abstraction for building tool-using agents. Each tool is a Python function decorated with @tool, and the agent automatically generates tool descriptions for the LLM from the function’s docstring and type hints.
The parser tool delegates to a dedicated Strands agent that acts as a structured output extractor:
The searcher tool combines SageMaker AI embedding generation with pgvector similarity search:
The summarizer tool uses another dedicated Strands agent for scientific analysis:
The orchestrator ties everything together. It receives the user’s query and decides which tools to call and in what order:
This design uses the “agents-as-tools” pattern: the parser and summarizer are themselves Strands agents, but they are wrapped in @tool decorators and exposed to the orchestrator as callable tools. The orchestrator does not know or care that these tools internally use LLMs — it calls them as functions. This keeps the orchestration logic clean while allowing each tool to leverage LLM capabilities where needed.
Amazon Bedrock AgentCore provides a managed runtime for hosting AI agents. The agent code runs in a containerized environment built and deployed via AWS CodeBuild — no local Docker installation is required.
The AgentCore runtime expects an entrypoint function that receives a payload and context:
The entrypoint captures tool outputs in a shared dictionary so that the response includes structured data (parsed query, search results table, summary text) instead of the agent’s final text output alone. This structured response is what the Streamlit frontend uses to render tables and expandable sections.
The deployment uses AWS CloudFormation for all infrastructure. The VPC stack creates private subnets with NAT gateways and VPC endpoints for Amazon Bedrock, Amazon RDS Data API, and AWS Secrets Manager — helping to ensure the agent runtime can reach all required services without traversing the public internet.
Amazon Aurora PostgreSQL-Compatible Edition Serverless v2 database will be required with automatic scaling from 0.5 to 4 ACUs (1–8 GB RAM). An AWS Lambda-backed custom resource initializes the pgvector extension and creates the peptides table during stack creation:
The solution requires the following components, deployed in order:
Warning: Complete the deployment steps in order. Skipping steps may result in deployment failures.
The frontend is a lightweight Streamlit application that communicates with the AgentCore runtime via the bedrock-agentcore boto3 client. It runs on AWS Fargate with a minimal container image that includes only streamlit, pandas, and boto3 — no ML libraries.
The UI displays results in three sections: the parsed query parameters (expandable), a sortable table of similar peptides with cosine distances and metadata, and the AI-generated scientific summary. Users can download results as CSV for further analysis.
The following screenshot shows the search query and the results.
Before deploying this solution to production, keep the following design and operational trade-offs in mind:
Cold start latency. The SageMaker AI serverless endpoint takes 2–3 minutes on the first invocation after an idle period while the container initializes and loads model weights. Subsequent invocations within the keep-alive window complete in seconds. For latency-sensitive workloads, consider a provisioned endpoint or setting a higher provisioned concurrency on the serverless configuration.
Embedding model choice. We use ESM-C 300M for its balance of embedding quality and inference speed on CPU. For higher accuracy on structural similarity tasks, ESM-C 600M or ESM2 models offer larger embedding dimensions at the cost of increased memory and latency. The 960-dimensional embeddings from ESM-C 300M provide strong performance for peptide similarity search in testing.
Scaling the dataset. The initial load uses 1,000 sampled peptides from the IEDB dataset. For production use with larger datasets, consider batch-loading embeddings, increasing the IVFFlat index lists parameter proportionally, and scaling Aurora ACUs accordingly. The Amazon RDS Data API has a 1 MB response size limit, so queries returning large result sets may need pagination.
Cost. The serverless components (SageMaker AI serverless endpoint, Aurora Serverless v2, AgentCore runtime) scale to near-zero when idle, making this architecture cost-effective for research workloads with intermittent usage patterns. The primary ongoing costs during active use are Bedrock LLM inference (three calls per query: parser, orchestrator, summarizer) and SageMaker AI endpoint invocations.
To avoid ongoing charges, delete the resources in the following order:
Warning: Delete resources in reverse order to avoid dependency errors.
This post showed you how to build a protein research copilot that combines protein language model embeddings with LLM-powered analysis in a single conversational interface.
What traditionally requires a researcher to manually query sequence databases, run alignment tools, and interpret results across multiple applications — a process that can take hours per search — is reduced to a single natural language query that returns ranked, summarized results in under a minute (or 2–3 minutes on cold start). This consolidation of parsing, embedding-based search, and scientific summarization into one conversational workflow can significantly accelerate the early stages of peptide research and candidate screening.
The Strands Agents SDK’s tool-use pattern provides a clean way to compose specialized capabilities — parsing, searching, summarizing — into a coherent workflow, while Amazon Bedrock AgentCore handles the operational complexity of hosting and scaling the agent.
The same architecture generalizes beyond peptide research. Domains where researchers need to search over specialized embeddings, filter by structured metadata, and synthesize results — genomics, drug design, materials science — can benefit from this pattern of combining domain-specific embedding models with LLM orchestration. The key design decisions that make this practical are: bundling model weights to avoid cold-start downloads, using the Amazon RDS Data API to simplify networking, and automating the deployment with infrastructure as code.
As next steps, consider exploring larger ESM models for higher embedding accuracy, adding support for batch queries, or extending the metadata schema to include additional biological annotations from the IEDB dataset.
Vita R, Blazeska N, Marrama D; IEDB Curation Team Members; Duesing S, Bennett J, Greenbaum J, De Almeida Mendes M, Mahita J, Wheeler DK, Cantrell JR, Overton JA, Natale DA, Sette A, Peters B. The Immune Epitope Database (IEDB): 2024 update. Nucleic Acids Res. 2025 Jan 6;53(D1):D436-D443. doi: 10.1093/nar/gkae1092. PMID: 39558162; PMCID: PMC11701597.
ESM Team. “ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning.” EvolutionaryScale, 2024. https://evolutionaryscale.ai/blog/esm-cambrian
Hey everyone, We're the team behind Krea, and today we're launching Krea 2, our new…
The current era of Generative AI seems to primarily focus on chat interfaces and prompts,…
Protecting sensitive data used with AI is a critical part of our commitment to providing…
It's one of the best times to snag yourself a Dyson device, whether it's a…
Spiking neural networks (SNNs) are artificial intelligence (AI) models inspired by how biological neurons communicate…
the joker stairs but it's a waterfall now 🌊 wide shots land clean, close-ups are…