Recently, we’ve been witnessing the rapid development and evolution of generative AI applications, with observability and evaluation emerging as critical aspects for developers, data scientists, and stakeholders. Observability refers to the ability to understand the internal state and behavior of a system by analyzing its outputs, logs, and metrics. Evaluation, on the other hand, involves assessing the quality and relevance of the generated outputs, enabling continual improvement.
Comprehensive observability and evaluation are essential for troubleshooting, identifying bottlenecks, optimizing applications, and providing relevant, high-quality responses. Observability empowers you to proactively monitor and analyze your generative AI applications, and evaluation helps you collect feedback, refine models, and enhance output quality.
In the context of Amazon Bedrock, observability and evaluation become even more crucial. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. As the complexity and scale of these applications grow, providing comprehensive observability and robust evaluation mechanisms are essential for maintaining high performance, quality, and user satisfaction.
We have built a custom observability solution that Amazon Bedrock users can quickly implement using just a few key building blocks and existing logs using FMs, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. This solution uses decorators in your application code to capture and log metadata such as input prompts, output results, run time, and custom metadata, offering enhanced security, ease of use, flexibility, and integration with native AWS services.
Notably, the solution supports comprehensive Retrieval Augmented Generation (RAG) evaluation so you can assess the quality and relevance of generated responses, identify areas for improvement, and refine the knowledge base or model accordingly.
In this post, we set up the custom solution for observability and evaluation of Amazon Bedrock applications. Through code examples and step-by-step guidance, we demonstrate how you can seamlessly integrate this solution into your Amazon Bedrock application, unlocking a new level of visibility, control, and continual improvement for your generative AI applications.
By the end of this post, you will:
To implement the observability solution discussed in this post, you need the following prerequisites:
The observability solution for Amazon Bedrock empowers users to track and analyze interactions with FMs, knowledge bases, guardrails, and agents using decorators in their source code. Key highlights of the solution include:
Here’s a high-level overview of the observability solution architecture:
The following steps explain how the solution works:
@bedrock_logs.watch
to save the logcall_type
variableThis architecture provides comprehensive logging, efficient data processing, and powerful analytics capabilities for your Amazon Bedrock applications.
To help you get started with the observability solution, we have provided example notebooks in the attached GitHub repository, covering knowledge bases, evaluation, and agents for Amazon Bedrock. These notebooks demonstrate how to integrate the solution into your Amazon Bedrock application and showcase various use cases and features including feedback collected from users or quality assurance (QA) teams.
The repository contains well-documented notebooks that cover topics such as:
To get started with the example notebooks, follow these steps:
These notebooks provide a hands-on learning experience and serve as a starting point for integrating our solution into your generative AI applications. Feel free to explore, modify, and adapt the code examples to suit your specific requirements.
The solution offers a range of powerful features to streamline observability and evaluation for your generative AI applications on Amazon Bedrock:
InvokeModel
, batch inference, knowledge bases, agents, and guardrails, providing a unified solution for your generative AI applicationsThis concise list highlights the key features you can use to gain insights, optimize performance, and drive continual improvement for your generative AI applications on Amazon Bedrock. For a detailed breakdown of the features and implementation specifics, refer to the comprehensive documentation in the GitHub repository.
The solution is designed to be modular and flexible so you can customize it according to your specific requirements. Although the implementation is straightforward, following best practices is crucial for the scalability, security, and maintainability of your observability infrastructure.
This solution includes an AWS CloudFormation template that streamlines the deployment of required AWS resources, providing consistent and repeatable deployments across environments. The CloudFormation template provisions resources such as Amazon Data Firehose delivery streams, AWS Lambda functions, Amazon S3 buckets, and AWS Glue crawlers and databases.
The solution uses the decorator pattern to integrate observability logging into your application functions seamlessly. The @bedrock_logs.watch
decorator wraps your functions, automatically logging inputs, outputs, and metadata to Amazon Kinesis Firehose. Here’s an example of how to use the decorator:
The solution supports human-in-the-loop evaluation so you can incorporate human feedback into the performance evaluation of your generative AI application. You can involve end users, experts, or QA teams in the evaluation process, providing insights to enhance output quality and relevance. Here’s an example of how you can implement human-in-the-loop evaluation:
By using the run_id
and observation_id
generated, you can associate human feedback with specific model responses or sessions. This feedback can then be analyzed and used to refine the knowledge base, fine-tune models, or identify areas for improvement.
It’s recommended to follow these best practices:
call_type
) for your observability data based on different workflows or application components. This enables easier analysis and processing of specific data subsets.feedback_variables=True
when initializing BedrockLogs
to generate run_id
and observation_id
. These IDs can be used to join logically partitioned datasets, associating feedback data with corresponding model responses.capture_input
and capture_output
parameters to selectively log function inputs or outputs or exclude sensitive information or large data structures that might not be relevant for observability.KnowledgeBasesEvaluations
By following these best practices and using the features of the solution, you can set up comprehensive observability and evaluation for your generative AI applications to gain valuable insights, identify areas for improvement, and enhance the overall user experience.
In the next post in this three-part series, we dive deeper into observability and evaluation for RAG and agent-based generative AI applications, providing in-depth insights and guidance.
To avoid incurring costs and maintain a clean AWS account, you can remove the associated resources by deleting the AWS CloudFormation stack you created for this walkthrough. You can follow the steps provided in the Deleting a stack on the AWS CloudFormation console documentation to delete the resources created for this solution.
This comprehensive solution empowers you to seamlessly integrate comprehensive observability into your generative AI applications in Amazon Bedrock. Key benefits include streamlined integration, selective logging, custom metadata tracking, and comprehensive evaluation capabilities, including RAG evaluation. Use AWS services such as Athena to analyze observability data, drive continual improvement, and connect with your favorite dashboard tool to visualize the data.
This post focused is on Amazon Bedrock, but it can be extended to broader machine learning operations (MLOps) workflows or integrated with other AWS services such as AWS Lambda or Amazon SageMaker. We encourage you to explore this solution and integrate it into your workflows. Access the source code and documentation in our GitHub repository and start your integration journey. Embrace the power of observability and unlock new heights for your generative AI applications.
TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…
Whether a company begins with a proof-of-concept or live deployment, they should start small, test…
Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…
Machine learning (ML) models are built upon data.
Editor’s note: This is the second post in a series that explores a range of…
David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…