Categories: FAANG

Automated Reasoning checks rewriting chatbot reference implementation

ML 20387 image 1

Today, we are publishing a new open source sample chatbot that shows how to use feedback from Automated Reasoning checks to iterate on the generated content, ask clarifying questions, and prove the correctness of an answer.

The chatbot implementation also produces an audit log that includes mathematically verifiable explanations for the answer validity and a user interface that shows developers the iterative, rewriting process happening behind the scenes. Automated Reasoning checks use logical deduction to automatically demonstrate that a statement is correct. Unlike large language models, Automated Reasoning tools are not guessing or predicting accuracy. Instead, they rely on mathematical proofs to verify compliance with policies. This blog post dives deeper into the implementation architecture for the Automated Reasoning checks rewriting chatbot.

Improve accuracy and transparency with Automated Reasoning checks

LLMs can sometimes generate responses that sound convincing but contain factual errors—a phenomenon known as hallucination. Automated Reasoning checks validate a user’s question and an LLM-generated answer, giving rewriting feedback that points out ambiguous statements, assertions that are too broad, and factually incorrect claims based on ground truth knowledge encoded in Automated Reasoning policies.

A chatbot that uses Automated Reasoning checks to iterate on its answers before presenting them to users helps improve accuracy because it can make precise statements that explicitly answer users’ yes/no questions without leaving room for ambiguity; and helps improve transparency because it can provide mathematically verifiable proofs of why its statements are correct, making generative AI applications auditable and explainable even in regulated environments.

Now that you understand the benefits, let’s explore how you can implement this in your own applications.

Chatbot reference implementation

The chatbot is a Flask application that exposes APIs to submit questions and check the status of an answer. To show the inner workings of the system, the APIs also let you retrieve information about the status of each iteration, the feedback from Automated Reasoning checks, and the rewriting prompt sent to the LLM.

You can use the frontend NodeJS application to configure an LLM from Amazon Bedrock to generate answers, select an Automated Reasoning policy for validation, and set the maximum number of iterations to correct an answer. Selecting a chat thread in the user interface opens a debug panel on the right that displays each iteration on the content and the validation output.

Figure 1 – Chat interface with debug panel

Once Automated Reasoning checks say a response is valid, the verifiable explanation for the validity is displayed.

Figure 2 – Automated Reasoning checks validity proof

How the iterative rewriting loop works

The open source reference implementation automatically helps improve chatbot answers by iterating on the feedback from Automated Reasoning checks and rewriting the response. When asked to validate a chatbot question and answer (Q&A), Automated Reasoning checks return a list of findings. Each finding represents an independent logical statement identified in the input Q&A. For example, for the Q&A “How much does S3 storage cost? In US East (N. Virginia), S3 costs $0.023/GB for the first 50Tb; in Asia Pacific (Sydney), S3 costs $0.025/GB for the first 50Tb” Automated Reasoning checks would produce two findings, one that validates the price for S3 in us-east-1 is $0.023, and one for ap-southeast-2.

When parsing a finding for a Q&A, Automated Reasoning checks separate the input into a list of factual premises and claims made against those premises. A premise can be a factual statement in the user question, like “I’m an S3 user in Virginia,” or an assumption laid out in the answer, like “For requests sent to us-east-1…” A claim represents a statement being verified. In our S3 pricing example from the previous paragraph, the Region would be a premise, and the price point would be a claim.

Each finding includes a validation result (VALID, INVALID, SATISFIABLE, TRANSLATION_AMBIGUOUS, IMPOSSIBLE) as well as the feedback necessary to rewrite the answer so that it is VALID. The feedback changes depending on the validation result. For example, ambiguous findings include two interpretations of the input text, satisfiable findings include two scenarios that show how the claims could be true in some cases and false in others. You can see the possible finding types in our API documentation.

With this context out of the way, we can dive deeper into how the reference implementation works:

Initial response and validation

When the user submits a question through the UI, the application first calls the configured Bedrock LLM to generate an answer, then calls the ApplyGuardrail API to validate the Q&A.

Using the output from Automated Reasoning checks in the ApplyGuardrail response, the application enters a loop where each iteration checks the Automated Reasoning checks feedback, performs an action like asking the LLM to rewrite an answer based on the feedback, and then calls ApplyGuardrail to validate the updated content again.

The rewriting loop (The heart of the system)

After the initial validation, the system uses the output from the Automated Reasoning checks to decide the next step. First, it sorts the findings based on their priority – addressing the most important first: TRANSLATION_AMBIGUOUS, IMPOSSIBLE, INVALID, SATISFIABLE, VALID. Then, it selects the highest priority finding and addresses it with the logic below. Since VALID is last in the prioritized list, the system will only accept something as VALID after addressing the other findings.

For TRANSLATION_AMBIGUOUS findings, the Automated Reasoning checks return two interpretations of the input text. For SATISFIABLE findings, the Automated Reasoning checks return two scenarios that prove and disprove the claims. Using the feedback, the application asks the LLM to decide on whether it wants to try and rewrite the answer to clarify ambiguities or ask the user follow up questions to gather additional information. For example, the SATISFIABLE feedback may say that the price of $0.023 is valid only if the Region is US East (N. Virginia). The LLM can use this information to ask about the application Region. When the LLM decides to ask follow-up questions, the loop pauses and waits for the user to answer the questions, then the LLM regenerates the answer based on the clarifications and the loop restarts.
For IMPOSSIBLE findings, the Automated Reasoning checks return a list of the rules that contradict the premises – accepted facts in the input content. Using the feedback, the application asks the LLM to rewrite the answer to avoid logical inconsistencies.
For INVALID findings, the Automated Reasoning checks return the rules from the Automated Reasoning policy that make the claims invalid based on the premises and policy rules. Using the feedback, the application asks the LLM to rewrite its answer so that it is consistent with the rules.
For VALID findings, the application exits the loop and returns the answer to the user.

After each answer rewrite, the system sends the Q&A to the ApplyGuardrail API for validation; the next iteration of the loop starts with the feedback from this call. Each iteration stores the findings and prompts with full context in the thread data structure, creating an audit trail of how the system arrived at the definitive answer.

Getting Started with the Automated Reasoning checks rewriting chatbot

To try our reference implementation, the first step is to create an Automated Reasoning policy:

Navigate to Amazon Bedrock in the AWS Management Console in one of the supported Regions in the United States or European Regions.
From the left navigation, open the Automated Reasoning page in the Build category.
Using the dropdown menu of the Create policy button, choose Create sample policy.
Enter a name for the policy and then choose Create policy at the bottom of the page.

Once you have created a policy, you can proceed to download and run the reference implementation:

Clone the Amazon Bedrock Samples repository.
Follow the instructions in the README file to install dependencies, build the frontend, and start the application.
Using your preferred browser navigate to http://localhost8080 and start testing.

Backend implementation details

If you’re planning to adapt this implementation for production use, this section goes over the key components in the backend architecture. You will find these components in the backend directory of the repository.

ThreadManager: Orchestrates a conversation lifecycle management. It handles the creation, retrieval, and status tracking of conversation threads, maintaining proper state throughout the rewriting process. The ThreadManager implements thread-safe operations using a lock to help prevent race conditions when multiple operations attempt to modify the same conversation simultaneously. It also tracks threads awaiting user input and can identify stale threads that have exceeded a configurable timeout.
ThreadProcessor: Handles the rewriting loop using a state machine pattern for clear, maintainable control flow. The processor manages state transitions between phases like GENERATE_INITIAL, VALIDATE, CHECK_QUESTIONS, HANDLE_RESULT, and REWRITING_LOOP, progressing the conversation correctly through each stage.
ValidationService: Integrates with Amazon Bedrock Guardrails. This service takes each LLM-generated response and submits it for validation using the ApplyGuardrail API. It handles the communication with AWS, manages retry logic with exponential backoff for transient failures, and parses the validation results into structured findings.
LLMResponseParser: Interprets the LLM’s intentions during the rewriting loop. When the system asks the LLM to fix an invalid response, the model must decide whether to attempt a rewrite (REWRITE), ask clarifying questions (ASK_QUESTIONS), or declare the task impossible due to contradictory premises (IMPOSSIBLE). The parser examines the LLM’s response for specific markers like “DECISION:“, “ANSWER:“, and “QUESTION:“, extracting structured information from natural language output. It handles markdown formatting gracefully and enforces limits on the number of questions (maximum 5).
AuditLogger: Writes structured JSON logs to a dedicated audit log file, recording two key event types: VALID_RESPONSE when a response passes validation, and MAX_ITERATIONS_REACHED when the system exhausts the set number of retry attempts. Each audit entry captures the timestamp, thread ID, prompt, response, model ID, and validation findings. The logger also extracts and records Q&A exchanges from clarification iterations, including whether the user answered or skipped the questions.

Together, these components help create a robust foundation for building trustworthy AI applications that combine the flexibility of large language models with the rigor of mathematical verification.

For detailed guidance on implementing Automated Reasoning checks in production:

Workshop: Generative AI Reliability with Automated Reasoning checks
Technical Blog: Minimize generative AI hallucinations with Amazon Bedrock Automated Reasoning checks
Use Case Blog: Build verifiable explainability into financial services workflows with Automated Reasoning checks for Amazon Bedrock Guardrails
Documentation: Amazon Bedrock Guardrails User Guide

About the authors

Build verifiable explainability into financial services workflows with Automated Reasoning checks for Amazon Bedrock Guardrails

February 20, 2025

In "FAANG"

Build reliable AI systems with Automated Reasoning on Amazon Bedrock – Part 1

November 1, 2025

In "FAANG"

TASER: Translation Assessment via Systematic Evaluation and Reasoning

We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both…

October 3, 2025

In "FAANG"

AI Generated Robotic Content