Amazon Q Business is a fully managed service that lets you build interactive chat applications using your enterprise data. These applications can generate answers based on your data or a large language model (LLM) knowledge. Your data is not used for training purposes, and the answers provided by Amazon Q Business are based solely on the data users have access to.
Enterprise data is often distributed across different sources, such as documents in Amazon Simple Storage Service (Amazon S3) buckets, database engines, websites, and more. In this post, we demonstrate how to create an Amazon Q Business application and index website contents using the Amazon Q Web Crawler connector for Amazon Q Business.
For this example, we use two data sources (websites). The first data source is an employee onboarding guide from a fictitious company, which requires basic authentication. We demonstrate how to set up authentication for the Web Crawler. The second data source is the official documentation for Amazon Q Business. For this data source, we demonstrate how to apply advanced settings, such as regular expressions, to instruct the Web Crawler to crawl only pages and links related to Amazon Q Business, ignoring pages related to other AWS services.
The Amazon Q Web Crawler connector makes it possible to crawl websites that use HTTPS and index their contents so you can build a generative artificial intelligence (AI) experience for your users based on the indexed data. This connector relies on the Selenium Web Crawler Package and a Chromium driver. The connector is fully managed and updates to these components are applied automatically without your intervention.
This connector crawls and indexes the contents of webpages and attachments. Amazon Q Business supports multiple connectors, and each connector has its own properties and entities that it considers documents. In the context of the Web Crawler connector, a document refers to a single page or attachment contents. Separately, an index is commonly referred to as a corpus of documents; think of it as the place where you add and sync your documents for Amazon Q Business to use for generating answers to user requests.
Each document has its own attributes, also known as metadata. Metadata can be mapped to fields in your Amazon Q Business index. By creating index fields, you can boost results based on document attributes. For example, there might be use cases where you want to give more relevance to results from a specific category, department, or creation date.
Amazon Q Business data source connectors are designed to crawl the default attributes in your data source automatically. You can also add custom document attributes and map them to custom fields in your index. To learn more, see Mapping document attributes in Amazon Q Business.
For a better understanding of what is indexed by the Web Crawler connector, we present a list of metadata indexed from webpages and attachments.
The following table lists webpage metadata indexed by the Amazon Q Web Crawler connector.
Field | Data Source Field | Amazon Q Business Index Field (reserved) | Field Type |
Category | category | _category | String |
URL | sourceUrl | _source_uri | String |
Title | title | _document_title | String |
Meta Tags | metaTags | wc_meta_tags | String List |
File Size | htmlSize | wc_html_size | Long (numeric) |
The following table lists attachments metadata indexed by the Amazon Q Web Crawler connector.
Field | Data Source Field | Amazon Q Business Index Field (reserved) | Field Type |
Category | category | _category | String |
URL | sourceUrl | _source_uri | String |
File Name | fileName | wc_file_name | String |
File Type | fileType | wc_file_type | String |
File Size | fileSize | wc_file_size | Long (numeric) |
When configuring the data source for your website, you can use URLs or sitemaps, which can be defined either manually or using a text file stored in Amazon S3.
To enforce secure access to protected websites, the Amazon Q Web Crawler supports the following authentication types and standards:
Unlike other data source connectors, the Amazon Q Web Crawler connector doesn’t support access control list (ACL) crawling or identity crawling.
Lastly, you have a range of options for configuring how and what data is synchronized. For example, you can choose to synchronize website domains only, website domains with subdomains only, or website domains with subdomains and the webpages included in links. Additionally, you can use regular expressions to filter which URLS to include or exclude in the crawling process.
On a high level, this solution consists of an Amazon Q Business application that utilizes two data sources: a website hosting documents related to an employee onboarding guide, and the Amazon Q Business official documentation website. This solution demonstrates how to configure both websites as data sources for the Amazon Q Business application. The following steps will be performed:
You can follow along using one or both data sources provided in this post or try your own URLs.
To follow along with this demo, you should have the following prerequisites:
Deploying this CloudFormation template is optional, but we recommend using it so you can learn more about how the Web Crawler connector works with websites that require authentication.
We start by deploying a CloudFormation template. This template will create a simple static website secured with basic authentication.
https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-16532/template-website.yml
onboarding-website-for-q-business-sample
.The deployment process will take a few minutes to complete. You can move to the next section of this post while it’s in process. Keep this tab open—you’ll need to refer to the Outputs tab later.
Before you start creating Amazon Q Business applications, you are required to enable and configure an IAM Identity Center instance. This step is mandatory because Amazon Q Business integrates with IAM Identity Center to manage user access to your Amazon Q Business applications. If you don’t have an IAM Identity Center instance set up when trying to create your first application, you will see the option to create one, as shown in the following screenshot.
If you already have an IAM Identity Center instance set up, you’re ready to start creating your first application by following these steps:
my-q-business-app
.1
for Number of units. One unit can index 20,000 documents (a document in this context is either a single page of content or a single attachment).After you complete the steps in the previous section, you should see the Connect data sources page, as shown in the following screenshot.
If you closed the tab by accident, you can get to this page by navigating to the Amazon Q Business console, choosing your application name, and then choosing Add data source.
Let’s create the data source for the Amazon Q Business documentation website:
q-business-documentation
https://docs.aws.amazon.com/amazonq/
Starting point URLs can be added directly in this UI (up to 10), or you could use a file hosted in Amazon S3 to list up to 100 starting point URLs. Likewise, sitemap URLs can be added in this UI (up to three), or you could add up to three sitemap XML files hosted in Amazon S3.
We refer to source URLs as starting point URLs; later in this post, you’ll have the opportunity to define what gets crawled, for example, domains and subdomains that the webpages might link to. It’s worth mentioning that the Web Crawler connector can only work with HTTPS.
If you open the Amazon Q official documentation, you’ll see that there are links to Amazon Q Developer documentation and other AWS services. Because we’re only interested in crawling Amazon Q Business, we need to instruct the crawler to focus only on relevant links and pages related to Amazon Q Business. To achieve this, we use regular expressions to define exactly what URLs the crawler should crawl.
^https://docs.aws.amazon.com/amazonq/$
^https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/.*.html$
^https://docs.aws.amazon.com/amazonq/latest/business-use-dg/.*.html$
Choosing this option means you must manually run the sync operation; this option is suitable given the simplicity of this example. For production workloads, you’ll want to define a schedule tailored to your needs, for example, hourly, daily, or weekly, or you could define your own schedule using a cron expression.
The default values in the Field mappings section can’t be changed at this point. This can only be modified after the application and retriever have been created.
After the data source is created, you will be shown the same interface you saw at the beginning of this section, with the note that one Web Crawler data source has been added. Keep this tab open, because you’ll create a second data source for the employee onboarding guide in the next section.
Complete the following steps to create your second data source:
Although unlikely, if the URL isn’t working, it might be because Amazon CloudFront hasn’t finished replicating the website. In that case, you should wait a couple of minutes and try again.
You should now be able to browse the employee onboarding guide. Take a few minutes to get familiar with the contents of the website, because you’ll be asking your Amazon Q Business application questions about this content in a later step.
onboarding-guide
.These credentials will be stored as a secret in AWS Secrets Manager.
Depending on the type of authentication you use, you’ll need certain fields present in your secret, as shown in the following table.
Authentication Type | Fields present in secret |
Form based | username, password, userNameFieldXpath, passwordFieldXpath, passwordButtonXpath, loginPageUrl |
NTLM | username, password |
Basic auth | username, password |
No Authentication | NA |
After changes are applied, the Connect data sources page shows two Web Crawler data sources have been added.
We have added our two data sources. In the next section, we add groups and users to our Amazon Q Business application.
Complete the following steps to add groups and users:
If you’ve completed the prerequisite of setting up IAM Identity Center, you’ve likely added at least one user. Although it’s not mandatory, we recommend creating multiple users and groups. This will enable you to fully explore and understand all the features of Amazon Q Business beyond what’s covered in this post.
If you haven’t added any users to your Identity Center directory, you can create them here by choosing Add new users. However, you’ll need to complete additional steps, such as setting up their passwords on the IAM Identity Center console. To fully benefit from this tutorial, we recommend having active users and groups by the time you reach this step.
If you added a group, you’ll see it on the Groups tab. If you added a user, you’ll see it on the Users tab.
The next step is choosing a subscription for your groups or users.
This is a good time to get familiar with the Amazon Q Business subscription tiers and pricing. For this example, we use Q Business Pro, but you could also use a Q Business Lite subscription.
A web experience is the chat interface that your users will utilize to ask questions and perform tasks.
After the application is created successfully, you’ll be redirected to the Amazon Q Business console, where you can see your new application. Your application is ready, but the data sources haven’t synced any data yet. We’ll do that in the next steps.
You will see the Current sync state for both data sources as Syncing. This process might take several minutes.
After the data sources are synced, you will see their Last sync status as Completed.
You’re now ready to test your application! Keep this page open because you’ll need it for next steps.
At this point, you have created an Amazon Q Business application, added two data sources using the Amazon Q Web Crawler connector, added users to the application, and synchronized all data sources.
The next step is going through the full user experience of logging in to the application and running a few test queries to test our application.
You’ll be redirected to the AWS access portal URL, which is set up by IAM Identity Center.
You’re now on your Amazon Q Business app and ready to start asking questions!
For this example, we start by asking questions related to the employee onboarding website.
Amazon Q Business uses the onboarding guide data source you created earlier. If you choose Sources, you’ll see a list of in-text source citations in the form of a numbered list.
Now we ask questions related to the Amazon Q Business documentation.
Try it out with your own prompts!
In this section, we discuss several common issues and how to troubleshoot:
Although not covered in this post, we recommend exploring document enrichment. This functionality allows you to manipulate and enrich document attributes prior to being added to an index. The following are a couple of ideas for advanced applications of document enrichment:
After you finish testing the solution and to avoid incurring in extra costs, clean up the resources you created as part of this solution.
Let’s start by deleting the Amazon Q Business application.
You might be asked to complete an optional survey on your reasons for application deletion. You are can select multiple reasons (or none), then choose Submit.
The next step is to delete the CloudFormation stack responsible for deploying the employee onboarding website we used as a data source.
The stack deletion might take a few minutes. When the deletion is complete, you’ll see the stack has been removed from your list of stacks.
Optionally, if you enabled IAM Identity Center only for this tutorial and want to delete your IAM Identity Center instance, follow these steps:
The Amazon Q Business Web Crawler allows you to connect websites to your Amazon Q Business applications. This connector supports multiple forms of authentication (if required by your website) and can run sync jobs on a defined schedule.
To learn more about Amazon Q Business and its features, refer to the Amazon Q Business Developer Guide. For a comprehensive list of what can be done with this connector, refer to Connecting Web Crawler to Amazon Q Business.
Machine learning (ML) models are built upon data.
Editor’s note: This is the second post in a series that explores a range of…
David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…
Qualcomm did not violate a license with Arm when it acquired Nuvia for $1.4 billion,…
From layoffs to the return of Gamergate, video games—and the people who make and play…
Artificial intelligence that is as intelligent as humans may become possible thanks to psychological learning…