solution architecture
This post was co-written with Saurabh Gupta and Todd Colby from Pushpay.
Pushpay is a market-leading digital giving and engagement platform designed to help churches and faith-based organizations drive community engagement, manage donations, and strengthen generosity fundraising processes efficiently. Pushpay’s church management system provides church administrators and ministry leaders with insight-driven reporting, donor development dashboards, and automation of financial workflows.
Using the power of generative AI, Pushpay developed an innovative agentic AI search feature built for the unique needs of ministries. The approach uses natural language processing so ministry staff can ask questions in plain English and generate real-time, actionable insights from their community data. The AI search feature addresses a critical challenge faced by ministry leaders: the need for quick access to community insights without requiring technical expertise. For example, ministry leaders can enter “show me people who are members in a group, but haven’t given this year” or “show me people who are not engaged in my church,” and use the results to take meaningful action to better support individuals in their community. Most community leaders are time-constrained and lack technical backgrounds; they can use this solution to obtain meaningful data about their congregations in seconds using natural language queries.
By empowering ministry staff with faster access to community insights, the AI search feature supports Pushpay’s mission to encourage generosity and connection between churches and their community members. Early adoption users report that this solution has shortened their time to insights from minutes to seconds. To achieve this result, the Pushpay team built the feature using agentic AI capabilities on Amazon Web Services (AWS) while implementing robust quality assurance measures and establishing a rapid iterative feedback loop for continuous improvements.
In this post, we walk you through Pushpay’s journey in building this solution and explore how Pushpay used Amazon Bedrock to create a custom generative AI evaluation framework for continuous quality assurance and establishing rapid iteration feedback loops on AWS.
The solution consists of several key components that work together to deliver an enhanced search experience. The following figure shows the solution architecture diagram and the overall workflow.
Figure 1: AI Search Solution Architecture
To create the AI search feature, Pushpay developed the first iteration of the AI search agent. The solution implements a single agent configured with a carefully tuned system prompt that includes the system role, instructions, and how the user interface works with detailed explanation of each filter tool and their sub-settings. The system prompt is cached using Amazon Bedrock prompt caching to reduce token cost and latency. The agent uses the system prompt to invoke an Amazon Bedrock LLM which generates the JSON document that Pushpay’s application uses to apply filters and present query results to users.
However, this first iteration quickly revealed some limitations. While it demonstrated a 60-70% success rate with basic business queries, the team reached an accuracy plateau. The evaluation of the agent was a manual and tedious process Tuning the system prompt beyond this accuracy threshold proved challenging given the diverse spectrum of user queries and the application’s coverage of over 100 distinct configurable filters. These presented critical blockers for the team’s path to production.
Figure 2: AI Search First Solution
To address the challenges of measuring and improving agent accuracy, the team implemented a generative AI evaluation framework integrated into the existing architecture, shown in the following figure. This framework consists of four key components that work together to provide comprehensive performance insights and enable data-driven improvements.
Figure 3: Introducing the GenAI Evaluation Framework
The accuracy dashboard: Pinpointing weaknesses by domain
Because user queries are categorized into domain categories, the dashboard incorporates statistical confidence visualization using a 95% Wilson score interval to display accuracy metrics and query volumes at each domain level. By using categories, the team can pinpoint the AI agent’s weaknesses by domain. In the following example , the “activity” domain shows significantly lower accuracy than other categories.
Figure 4: Pinpointing Agent Weaknesses by Domain
Additionally, a performance dashboard, shown in the following figure, visualizes latency indicators at the domain category level, including latency distributions from p50 to p90 percentiles. In the following example, the activity domain exhibits notably higher latency than others.
Figure 5: Identifying Latency Bottlenecks by Domain
Strategic rollout through domain-Level insights
Domain-based metrics revealed varying performance levels across semantic domains, providing crucial insights into agent effectiveness. Pushpay used this granular visibility to make strategic feature rollout decisions. By temporarily suppressing underperforming categories—such as activity queries—while undergoing optimization, the system achieved 95% overall accuracy. By using this approach, users experienced only the highest-performing features while the team refined others to production standards.
Figure 6: Achieving 95% Accuracy with Domain-Level Feature Rollout
Strategic prioritization: Focusing on high-impact domains
To prioritize improvements systematically, Pushpay employed a 2×2 matrix framework plotting topics against two dimensions (shown in the following figure): Business priority (vertical axis) and current performance or feasibility (horizontal axis). This visualization placed topics with both high business value and strong existing performance in the top-right quadrant. The team then focused on these areas because they required less heavy lifting to achieve further accuracy improvement from already-good levels to an exceptional 95% accuracy for the business focused topics.
The implementation followed an iterative cycle: after each round of enhancements, they re-analyze the results to identify the next set of high-potential topics. This systematic, cyclical approach enabled continuous optimization while maintaining focus on business-critical areas.
Figure 7: Strategic Prioritization Framework for Domain Category Optimization
Dynamic prompt construction
The insights gained from the evaluation framework led to an architectural enhancement: the introduction of a dynamic prompt constructor. This component enabled rapid iterative improvements by allowing fine-grained control over which domain categories the agent could address. The structured field inventory – previously embedded in the system prompt – was transformed into a dynamic element, using semantic search to construct contextually relevant prompts for each user query. This approach tailors the prompt filter inventory based on three key contextual dimensions: query content, user persona, and tenant-specific requirements. The result is a more precise and efficient system that generates highly relevant responses while maintaining the flexibility needed for continuous optimization.
The generative AI evaluation framework became the cornerstone of Pushpay’s AI feature development, delivering measurable value across three dimensions:
The following are key takeaways from Pushpay’s experience that you can use in your own AI agent journey.
1/ Build with production in mind from day one
Building agentic AI systems is straightforward, but scaling them to production is challenging. Developers should adopt a scaling mindset during the proof-of-concept phase, not after. Implementing robust tracing and evaluation frameworks early, provides a clear pathway from experimentation to production. By using this method, teams can identify and address accuracy issues systematically before they become blockers.
2/ Take advantage of the advanced features of Amazon Bedrock
Amazon Bedrock prompt caching significantly reduces token costs and latency by caching frequently used system prompts. For agents with large, stable system prompts, this feature is essential for production-grade performance.
3/ Think beyond aggregate metrics
Aggregate accuracy scores can sometimes mask critical performance variations. By evaluating agent performance at the domain category level, Pushpay uncovered weaknesses beyond what a single accuracy metric can capture. This granular approach enables targeted optimization and informed rollout decisions, making sure users only experience high-performing features while others are refined.
4/ Data security and responsible AI
When developing agentic AI systems, consider information protection and LLM security considerations from the outset, following the AWS Shared Responsibility Model, because security requirements fundamentally impact the architectural design. Pushpay’s customers are churches and faith-based organizations who are stewards of sensitive information—including pastoral care conversations, financial giving patterns, family struggles, prayer requests and more. In this implementation example, Pushpay set a clear approach to incorporating AI ethically within its product ecosystem, maintaining strict security standards to ensure church data and personally identifiable information (PII) remains within its secure partnership ecosystem. Data is shared only with secure and appropriate data protections applied and is never used to train external models. To learn more about Pushpay’s standards for incorporating AI within their products, visit the Pushpay Knowledge Center for a more in-depth review of company standards.
Pushpay’s journey from a 60–70% accuracy prototype to a 95% accurate production-ready AI agent demonstrates that building reliable agentic AI systems requires more than just sophisticated prompts—it demands a scientific, data-driven approach to evaluation and optimization. The key breakthrough wasn’t in the AI technology itself, but in implementing a comprehensive evaluation framework built on strong observability foundation that provided granular visibility into agent performance across different domains. This systematic approach enabled rapid iteration, strategic rollout decisions, and continuous improvement.
Ready to build your own production-ready AI agent?
Link: https://huggingface.co/Tongyi-MAI/Z-Image Comfy https://huggingface.co/Comfy-Org/z_image/tree/main/split_files/diffusion_models submitted by /u/Altruistic_Heat_9531 [link] [comments]
Building a chatbot prototype takes hours.
The common approach to communicate a large language model’s (LLM) uncertainty is to add a…
Editor’s Note: This blog post responds to allegations published by the Electronic Frontier Foundation (EFF)…
The world of artificial intelligence is moving at lightning speed. At Google Cloud, we’re committed…
While popular AI models such as ChatGPT are trained on language or photographs, new models…