Enterprises are facing challenges in accessing their data assets scattered across various sources because of increasing complexities in managing vast amount of data. Traditional search methods often fail to provide comprehensive and contextual results, particularly for unstructured data or complex queries.
Search solutions in modern big data management must facilitate efficient and accurate search of enterprise data assets that can adapt to the arrival of new assets. Customers want to search through all of the data and applications across their organization, and they want to see the provenance information for all of the documents retrieved. The application needs to search through the catalog and show the metadata information related to all of the data assets that are relevant to the search context. To accomplish all of these goals, the solution should include the following features:
In this post, we present a generative AI-powered semantic search solution that empowers business users to quickly and accurately find relevant data assets across various enterprise data sources. In this solution, we integrate large language models (LLMs) hosted on Amazon Bedrock backed by a knowledge base that is derived from a knowledge graph built on Amazon Neptune to create a powerful search paradigm that enables natural language-based questions to integrate search across documents stored in Amazon Simple Storage Service (Amazon S3), data lake tables hosted on the AWS Glue Data Catalog, and enterprise assets in Amazon DataZone.
Foundation models (FMs) on Amazon Bedrock provide powerful generative models for text and language tasks. However, FMs lack domain-specific knowledge and reasoning capabilities. Knowledge graphs available on Neptune provide a means to represent interconnected facts and entities with inferencing and reasoning abilities for domains. Equipping FMs with structured reasoning abilities using domain-specific knowledge graphs harnesses the best of both approaches. This allows FMs to retain their inductive abilities while grounding their language understanding and generation in well-structured domain knowledge and logical reasoning. In the context of enterprise data asset search powered by a metadata catalog hosted on services such Amazon DataZone, AWS Glue, and other third-party catalogs, knowledge graphs can help integrate this linked data and also enable a scalable search paradigm that integrates metadata that evolves over time.
The solution integrates with your existing data catalogs and repositories, creating a unified, scalable semantic layer across the entire data landscape. When users ask questions in plain English, the search is not just for keywords; it comprehends the query’s intent and context, relating it to relevant tables, documents, and datasets across your organization. This semantic understanding enables more accurate, contextual, and insightful search results, making the entire company’s data as accessible and simple to search as using a consumer search engine, but with the depth and specificity your business demands. This significantly enhances decision-making, efficiency, and innovation throughout your organization by unlocking the full potential of your data assets. The following video shows the sample working solution.
Using graph data processing and the integration of natural language-based search on embedded graphs, these hybrid systems can unlock powerful insights from complex data structures.
The solution presented in this post consists of an ingestion pipeline and a search application UI that the user can submit queries to in natural language while searching for data assets.
The following diagram illustrates the end-to-end architecture, consisting of the metadata API layer, ingestion pipeline, embedding generation workflow, and frontend UI.
The ingestion pipeline (3) ingests metadata (1) from services (2), including Amazon DataZone, AWS Glue, and Amazon Athena, to a Neptune database after converting the JSON response from the service APIs into an RDF triple format. The RDF is converted into text and loaded into an S3 bucket, which is accessed by Amazon Bedrock (4) as the source of the knowledge base. You can extend this solution to include metadata from third-party cataloging solutions as well. The end-users access the application, which is hosted on Amazon CloudFront (5).
A state machine in AWS Step Functions defines the workflow of the ingestion process by invoking AWS Lambda functions, as illustrated in the following figure.
The functions perform the following actions:
For more details about RDF data format, refer to the W3C documentation.
Amazon Bedrock Knowledge Bases is configured to use the preceding S3 bucket as a data source to create a knowledge base. Amazon Bedrock Knowledge Bases creates vector embeddings from the text files using the Amazon Titan Text Embeddings v2 model.
A Streamlit application is hosted in Amazon Elastic Container Service (Amazon ECS) as a task, which provides a chatbot UI for users to submit queries against the knowledge base in Amazon Bedrock.
The following are prerequisites to deploy the solution:
A sample dataset is needed for testing the functionalities of the solution. In your AWS account, prepare a table using Amazon DataZone and Athena completing Step 1 through Step 8 in Amazon DataZone QuickStart with AWS Glue data. This will create a table and capture its metadata in the Data Catalog and Amazon DataZone.
To test how the solution is combining metadata from different data catalogs, create another table only in the Data Catalog, not in Amazon DataZone. On the Athena console, open the query editor and run the following query to create a new table:
Complete the following steps to deploy the application:
Parameter Name | Description | Default Value |
EnvironmentName | Unique name to distinguish different web applications in the same AWS account (min length 1 and max length 4). | dev |
S3DataPrefixKB | S3 object prefix where the knowledge base source documents (metadata files) should be stored. | knowledge_base |
Cpu | CPU configuration of the ECS task. | 512 |
Memory | Memory configuration of the ECS task. | 1024 |
ContainerPort | Port for the ECS task host and container. | 80 |
DesiredTaskCount | Number of desired ECS task count. | 1 |
MinContainers | Minimum containers for auto scaling. Should be less than or equal to DesiredTaskCount. | 1 |
MaxContainers | Maximum containers for auto scaling. Should be greater than or equal to DesiredTaskCount. | 3 |
AutoScalingTargetValue | CPU utilization target percentage for ECS task auto scaling. | 80 |
The CloudFormation stack creates the required resources to launch the application by invoking a series of nested stacks. It deploys the following resources in your AWS account:
After the CloudFormation stack is deployed, a Step Functions workflow will run automatically that orchestrates the metadata extract, transform, and load (ETL) job, and stores the final results in Amazon S3. View the execution status and details of the workflow by fetching the state machine Amazon Resource Name (ARN) from the CloudFormation stack. If AWS Lake Formation is enabled for the AWS Glue databases and tables in the account, complete the following steps after the CloudFormation stack is deployed to update the permission and extract the metadata details from AWS Glue and update the metadata details to load to the knowledge base:
You can search for the application stack name <MainStackName>-deploy-<EnvironmentName> (for example, mm-enterprise-search-deploy-dev) on the AWS CloudFormation console. Locate the web application URL in the stack outputs (CloudfrontURL). Launch the web application by choosing the URL link.
You can access the application from a web browser using the domain name of the Amazon CloudFront distribution created in the deployment steps. Log in using a user credential that exists in the Amazon Cognito user pool.
Now you can submit a query using a text input. The AWS account used in this example contains sample tables related to sales and marketing. We ask the question, “How to query sales data?” The answer includes metadata on the table mkt_sls_table that was created in the previous steps.
We ask another question: “How to get customer names from sales data?” In the previous steps, we created the raw_customer table, which wasn’t published as a data asset in Amazon DataZone. The table only exists in the Data Catalog. The application returns an answer that combines metadata from Amazon DataZone and AWS Glue.
This powerful solution opens up exciting possibilities for enterprise data discovery and insights. We encourage you to deploy it in your own environment and experiment with different types of queries across your data assets. Try combining information from multiple sources, asking complex questions, and see how the semantic understanding improves your search experience.
The total cost of running this setup is less than $10 per day. However, we recommend deleting the CloudFormation stack after use because the deployed resources incur costs. Deleting the main stack also deletes all the nested stacks except the VPC because of dependency. You also need to delete the VPC from the Amazon VPC console.
In this post, we presented a comprehensive and extendable multimodal search solution of enterprise data assets. The integration of LLMs and knowledge graphs shows that by combining the strengths of these technologies, organizations can unlock new levels of data discovery, reasoning, and insight generation, ultimately driving innovation and progress across a wide range of domains.
To learn more about LLM and knowledge graph use cases, refer to the following resources:
Understanding what's happening behind large language models (LLMs) is essential in today's machine learning landscape.
AI accelerationists have won as a consequence of the election, potentially sidelining those advocating for…
L'Oréal's first professional hair dryer combines infrared light, wind, and heat to drastically reduce your…
TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…
Whether a company begins with a proof-of-concept or live deployment, they should start small, test…
Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…