ML 18212 nl2sql problem scope alt res 279x300 1
This blog post is co-written with Renuka Kumar and Thomas Matthew from Cisco.
Enterprise data by its very nature spans diverse data domains, such as security, finance, product, and HR. Data across these domains is often maintained across disparate data environments (such as Amazon Aurora, Oracle, and Teradata), with each managing hundreds or perhaps thousands of tables to represent and persist business data. These tables house complex domain-specific schemas, with instances of nested tables and multi-dimensional data that require complex database queries and domain-specific knowledge for data retrieval.
Recent advances in generative AI have led to the rapid evolution of natural language to SQL (NL2SQL) technology, which uses pre-trained large language models (LLMs) and natural language to generate database queries in the moment. Although this technology promises simplicity and ease of use for data access, converting natural language queries to complex database queries with accuracy and at enterprise scale has remained a significant challenge. For enterprise data, a major difficulty stems from the common case of database tables having embedded structures that require specific knowledge or highly nuanced processing (for example, an embedded XML formatted string). As a result, NL2SQL solutions for enterprise data are often incomplete or inaccurate.
This post describes a pattern that AWS and Cisco teams have developed and deployed that is viable at scale and addresses a broad set of challenging enterprise use cases. The methodology allows for the use of simpler, and therefore more cost-effective and lower latency, generative models by reducing the processing required for SQL generation.
Generative accuracy is paramount for NL2SQL use cases; inaccurate SQL queries might result in a sensitive enterprise data leak, or lead to inaccurate results impacting critical business decisions. Enterprise-scale data presents specific challenges for NL2SQL, including the following:
The solution described in this post provides a set of optimizations that solve the aforementioned challenges while reducing the amount of work that has to be performed by an LLM for generating accurate output. This work extends upon the post Generating value from enterprise data: Best practices for Text2SQL and generative AI. That post has many useful recommendations for generating high-quality SQL, and the guidelines outlined might be sufficient for your needs, depending on the inherent complexity of the database schemas.
To achieve generative accuracy for complex scenarios, the solution breaks down NL2SQL generation into a sequence of focused steps and sub-problems, narrowing the generative focus to the appropriate data domain. Using data abstractions for complex joins and data structure, this approach enables the use of smaller and more affordable LLMs for the task. This approach results in reduced prompt size and complexity for inference, reduced response latency, and improved accuracy, while enabling the use of off-the-shelf pre-trained models.
The solution workflow narrows down the overall schema space into the data domain targeted by the user’s query. Each data domain corresponds to the set of database data structures (tables, views, and so on) that are commonly used together to answer a set of related user queries, for an application or business domain. The solution uses the data domain to construct prompt inputs for the generative LLM.
This pattern consists of the following elements:
Resolving identifiers involves extracting the named resources, as named entities, from the user’s query and mapping the values to unique IDs appropriate for the target data source prior to NL2SQL generation. This can be implemented using natural language processing (NLP) or LLMs to apply named entity recognition (NER) capabilities to drive the resolution process. This optional step has the most value when there are many named resources and the lookup process is complex. For instance, in a user query such as “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” there are named resources: ‘allyson felix’, ‘isabelle werth’, and ‘nedo nadi’. This step allows for rapid and precise feedback to the user when a resource can’t be resolved to an identifier (for example, due to ambiguity).
This optional process of handling many or paired identifiers is included to offload the burden on LLMs for user queries with challenging sets of identifiers to be incorporated, such as those that might come in pairs (such as ID-type, ID-value), or where there are many identifiers. Rather than having the generative LLM insert each unique ID into the SQL directly, the identifiers are made available by defining a temporary data structure (such as a temporary table) and a set of corresponding insert statements. The LLM is prompted with few-shot learning examples to generate SQL for the user query by joining with the temporary data structure, rather than attempt identity injection. This results in a simpler and more consistent query pattern for cases when there are one, many, or pairs of identifiers.
This step is aimed at simplifying complex data structures into a form that can be understood by the language model without having to decipher complex inter-data relationships. Complex data structures might appear as nested tables or lists within a table column, for instance.
We can define temporary data structures (such as views and tables) that abstract complex multi-table joins, nested structures, and more. These higher-level abstractions provide simplified data structures for query generation and execution. The top-level definitions of these abstractions are included as part of the prompt context for query generation, and the full definitions are provided to the SQL execution engine, along with the generated query. The resulting queries from this process can use simple set operations (such as IN, as opposed to complex joins) that LLMs are well trained on, thereby alleviating the need for nested joins and filters over complex data structures.
Several of the optimizations noted earlier require making some of the specifics of the data domain explicit. Fortunately, this only has to be done when schemas and use cases are onboarded or updated. The benefit is higher generative accuracy, reduced generative latency and cost, and the ability to support arbitrarily complex query requirements.
To capture the semantics of a data domain, the following elements are defined:
Accordingly, the prompt for generating the SQL is dynamic and constructed based on the data domain of the input question, with a set of specific definitions of data structure and rules appropriate for the input query. We refer to this set of elements as the data domain context. The purpose of the data domain context is to provide the necessary prompt metadata for the generative LLM. Examples of this, and the methods described in the previous sections, are included in the GitHub repository. There is one context for each data domain, as illustrated in the following figure.
This section describes the execution flow of the solution. An example implementation of this pattern is available in the GitHub repository. Access the repository to follow along with the code.
To illustrate the execution flow, we use an example database with data about Olympics statistics and another with the company’s employee vacation schedule. We follow the execution flow for the domain regarding Olympics statistics using the user query “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” to show the inputs and outputs of the steps in the execution flow, as illustrated in the following figure.
The first step of the NL2SQL flow is to preprocess the request. The main objective of this step is to classify the user query into a domain. As explained earlier, this narrows down the scope of the problem to the appropriate data domain for SQL generation. Additionally, this step identifies and extracts the referenced named resources in the user query. These are then used to call the identity service in the next step to get the database identifiers for these named resources.
Using the earlier mentioned example, the inputs and outputs of this step are as follows:
This step processes the named resources’ strings extracted in the previous step and resolves them to be identifiers that can be used in database queries. As mentioned earlier, the named resources (for example, “group22”, “user123”, and “I”) are looked up using solution-specific means, such through database lookups or an ID service.
The following code shows the execution of this step in our running example:
This step is pivotal in this pattern. Having obtained the domain and the named resources along with their looked-up IDs, we use the corresponding context for that domain to generate the following:
To create the prompt for the LLM, this step assembles the system prompt, the user prompt, and the received user query from the input, along with the domain-specific schema definition, including new temporary tables created as well as any join hints, and finally the few-shot examples for the domain. Other than the user query that is received as in input, other components are based on the values provided in the context for that domain.
A SQL script for creating required domain-specific temporary structures (such as views and tables) is constructed from the information in the context. The domain-specific schema in the LLM prompt, join hints, and the few-shot examples are aligned with the schema that gets generated by running this script. In our example, this step is shown in the following code. The output is a dictionary with two keys, llm_prompt and sql_preamble. The value strings for these have been clipped here; the full output can be seen in the Jupyter notebook.
Now that the prompt has been prepared along with any information necessary to provide the proper context to the LLM, we provide that information to the SQL-generating LLM in this step. The goal is to have the LLM output SQL with the correct join structure, filters, and columns. See the following code:
After the SQL query is generated by the LLM, we can send it off to the next step. At this step, the SQL preamble and the generated SQL are merged to create a complete SQL script for execution. The complete SQL script is then executed against the data store, a response is fetched, and then the response is passed back to the client or end-user. See the following code:
Overall, our tests have shown several benefits, such as:
We attribute the success of the solution with these excellent but lightweight models (compared to a Meta Llama 70B variant or Anthropic’s Claude Sonnet) to the points noted earlier, with the reduced LLM task complexity being the driving force. The implementation code demonstrates how this is achieved. Overall, by using the optimizations outlined in this post, natural language SQL generation for enterprise data is much more feasible than would be otherwise.
In this section, we illustrate how you might implement the architecture on AWS. The end-user sends their natural language queries to the NL2SQL solution using a REST API. Amazon API Gateway is used to provision the REST API, which can be secured by Amazon Cognito. The API is linked to an AWS Lambda function, which implements and orchestrates the processing steps described earlier using a programming language of the user’s choice (such as Python) in a serverless manner. In this example implementation, where Amazon Bedrock is noted, the solution uses Anthropic’s Claude Haiku 3.
Briefly, the processing steps are as follows:
The final result is obtained by running the preceding pipeline on Lambda. When the workflow is complete, the result is provided as a response to the REST API request.
The following diagram illustrates the solution architecture.
In this post, the AWS and Cisco teams unveiled a new methodical approach that addresses the challenges of enterprise-grade SQL generation. The teams were able to reduce the complexity of the NL2SQL process while delivering higher accuracy and better overall performance.
Though we’ve walked you through an example use case focused on answering questions about Olympic athletes, this versatile pattern can be seamlessly adapted to a wide range of business applications and use cases. The demo code is available in the GitHub repository. We invite you to leave any questions and feedback in the comments.
Renuka Kumar is a Senior Engineering Technical Lead at Cisco, where she has architected and led the development of Cisco’s Cloud Security BU’s AI/ML capabilities in the last 2 years, including launching first-to-market innovations in this space. She has over 20 years of experience in several cutting-edge domains, with over a decade in security and privacy. She holds a PhD from the University of Michigan in Computer Science and Engineering.
Toby Fotherby is a Senior AI and ML Specialist Solutions Architect at AWS, helping customers use the latest advances in AI/ML and generative AI to scale their innovations. He has over a decade of cross-industry expertise leading strategic initiatives and master’s degrees in AI and Data Science. Toby also leads a program training the next generation of AI Solutions Architects.
Shweta Keshavanarayana is a Senior Customer Solutions Manager at AWS. She works with AWS Strategic Customers and helps them in their cloud migration and modernization journey. Shweta is passionate about solving complex customer challenges using creative solutions. She holds an undergraduate degree in Computer Science & Engineering. Beyond her professional life, she volunteers as a team manager for her sons’ U9 cricket team, while also mentoring women in tech and serving the local community.
I know there are models available that can fill in or edit parts, but I'm…
As we look ahead, the relationship between engineers and AI systems will likely evolve from…
Lightweight, powerful, and generally inexpensive, the handheld vacuum is the perfect household helper.
Discover how latent bridge matching, pioneered by the Jasper research team, transforms image-to-image translation with…
Machine learning models have become increasingly sophisticated, but this complexity often comes at the cost…