Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today, generative AI can enable people without SQL knowledge. This generative AI task is called text-to-SQL, which generates SQL queries from natural language processing (NLP) and converts text into semantically correct SQL. The solution in this post aims to bring enterprise analytics operations to the next level by shortening the path to your data using natural language.

With the emergence of large language models (LLMs), NLP-based SQL generation has undergone a significant transformation. Demonstrating exceptional performance, LLMs are now capable of generating accurate SQL queries from natural language descriptions. However, challenges still remain. First, human language is inherently ambiguous and context-dependent, whereas SQL is precise, mathematical, and structured. This gap may result in inaccurate conversion of the user’s needs into the SQL that’s generated. Second, you might need to build text-to-SQL features for every database because data is often not stored in a single target. You may have to recreate the capability for every database to enable users with NLP-based SQL generation. Third, despite the larger adoption of centralized analytics solutions like data lakes and warehouses, complexity rises with different table names and other metadata that is required to create the SQL for the desired sources. Therefore, collecting comprehensive and high-quality metadata also remains a challenge. To learn more about text-to-SQL best practices and design patterns, see Generating value from enterprise data: Best practices for Text2SQL and generative AI.

Our solution aims to address those challenges using Amazon Bedrock and AWS Analytics Services. We use Anthropic Claude v2.1 on Amazon Bedrock as our LLM. To address the challenges, our solution first incorporates the metadata of the data sources within the AWS Glue Data Catalog to increase the accuracy of the generated SQL query. The workflow also includes a final evaluation and correction loop, in case any SQL issues are identified by Amazon Athena, which is used downstream as the SQL engine. Athena also allows us to use a multitude of supported endpoints and connectors to cover a large set of data sources.

After we walk through the steps to build the solution, we present the results of some test scenarios with varying SQL complexity levels. Finally, we discuss how it is straightforward to incorporate different data sources to your SQL queries.

Solution overview

There are three critical components in our architecture: Retrieval Augmented Generation (RAG) with database metadata, a multi-step self-correction loop, and Athena as our SQL engine.

We use the RAG method to retrieve the table descriptions and schema descriptions (columns) from the AWS Glue metastore to ensure that the request is related to the right table and datasets. In our solution, we built the individual steps to run a RAG framework with the AWS Glue Data Catalog for demonstration purposes. However, you can also use knowledge bases in Amazon Bedrock to build RAG solutions quickly.

The multi-step component allows the LLM to correct the generated SQL query for accuracy. Here, the generated SQL is sent for syntax errors. We use Athena error messages to enrich our prompt for the LLM for more accurate and effective corrections in the generated SQL.

You can consider the error messages occasionally coming from Athena like feedback. The cost implications of an error correction step are negligible compared to the value delivered. You can even include these corrective steps as supervised reinforced learning examples to fine-tune your LLMs. However, we did not cover this flow in our post for simplicity purposes.

Note that there is always inherent risk of having inaccuracies, which naturally comes with generative AI solutions. Even if Athena error messages are highly effective to mitigate this risk, you can add more controls and views, such as human feedback or example queries for fine-tuning, to further minimize such risks.

Athena not only allows us to correct the SQL queries, but it also simplifies the overall problem for us because it serves as the hub, where the spokes are multiple data sources. Access management, SQL syntax, and more are all handled via Athena.

The following diagram illustrates the solution architecture.

The solution architecture and the process flow is shown.

Figure 1. The solution architecture and process flow.

The process flow includes the following steps:

  1. Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
  2. Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an Amazon OpenSearch Serverless vector store, which serves as our knowledge base in our RAG framework.

At this stage, the process is ready to receive the query in natural language. Steps 7–9 represent a correction loop, if applicable.

  1. The user enters their query in natural language. You can use any web application to provide the chat UI. Therefore, we did not cover the UI details in our post.
  2. The solution applies a RAG framework via similarity search, which adds the extra context from the metadata from the vector database. This table is used for finding the correct table, database, and attributes.
  3. The query is merged with the context and sent to Anthropic Claude v2.1 on Amazon Bedrock.
  4. The model gets the generated SQL query and connects to Athena to validate the syntax.
  5. If Athena provides an error message that mentions the syntax is incorrect, the model uses the error text from Athena’s response.
  6. The new prompt adds Athena’s response.
  7. The model creates the corrected SQL and continues the process. This iteration can be performed multiple times.
  8. Finally, we run the SQL using Athena and generate output. Here, the output is presented to the user. For the sake of architectural simplicity, we did not show this step.

Prerequisites

For this post, you should complete the following prerequisites:

  1. Have an AWS account.
  2. Install the AWS Command Line Interface (AWS CLI).
  3. Set up the SDK for Python (Boto3).
  4. Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
  5. Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an OpenSearch Serverless vector store.

Implement the solution

You can use the following Jupyter notebook, which includes all the code snippets provided in this section, to build the solution. We recommend using Amazon SageMaker Studio to open this notebook with an ml.t3.medium instance with the Python 3 (Data Science) kernel. For instructions, refer to Train a Machine Learning Model. Complete the following steps to set up the solution:

  1. Create the knowledge base in OpenSearch Service for the RAG framework:
    def add_documnets(self,index_name: str,file_name:str):
    
    documents = JSONLoader(file_path=file_name, jq_schema='.', text_content=False, json_lines=False).load()
    docs = OpenSearchVectorSearch.from_documents(embedding=self.embeddings, opensearch_url=self.opensearch_domain_endpoint, http_auth=self.http_auth, documents=documents, index_name=index_name, engine="faiss")
    index_exists = self.check_if_index_exists(index_name,aws_region,opensearch_domain_endpoint,http_auth)
    if not index_exists :
    logger.info(f'index :{index_name} is not existing ')
    sys.exit(-1)
    else:
    logger.info(f'index :{index_name} Got created')

  2. Build the prompt (final_question) by combining the user input in natural language (user_query), the relevant metadata from the vector store (vector_search_match), and our instructions (details):
    def userinput(user_query):
    logger.info(f'Searching metadata from vector store')
    
    # vector_search_match=rqst.getEmbeddding(user_query)
    vector_search_match = rqst.getOpenSearchEmbedding(index_name,user_query)
    
    # print(vector_search_match)
    details = "It is important that the SQL query complies with Athena syntax. 
    During join if column name are same please use alias ex llm.customer_id 
    in select statement. It is also important to respect the type of columns: 
    if a column is string, the value should be enclosed in quotes. 
    If you are writing CTEs then include all the required columns. 
    While concatenating a non string column, make sure cast the column to string. 
    For date columns comparing to string , please cast the string input."
    final_question = "nnHuman:"+details + vector_search_match + user_query+ "nnAssistant:"
    answer = rqst.generate_sql(final_question)
    return answer

  3. Invoke Amazon Bedrock for the LLM (Claude v2) and prompt it to generate the SQL query. In the following code, it makes multiple attempts in order to illustrate the self-correction step:x
    try:
    logger.info(f'we are in Try block to generate the sql and count is :{attempt + 1}')
    generated_sql = self.llm.predict(prompt)
    query_str = generated_sql.split("```")[1]
    query_str = " ".join(query_str.split("n")).strip()
    sql_query = query_str[3:] if query_str.startswith("sql") else query_str
    
    # return sql_query
    syntaxcheckmsg=rqstath.syntax_checker(sql_query)
    if syntaxcheckmsg=='Passed':
    logger.info(f'syntax checked for query passed in attempt number :{attempt + 1}')
    return sql_query

  4. If any issues are received with the generated SQL query ({sqlgenerated}) from the Athena response ({syntaxcheckmsg}), the new prompt (prompt) is generated based on the response and the model tries again to generate the new SQL:
    else:
    prompt = f"""{prompt} 
    This is syntax error: {syntaxcheckmsg}.
    To correct this, please generate an alternative SQL query which will correct the syntax error. The updated query should take care of all the syntax issues encountered. Follow the instructions mentioned above to remediate the error.
    Update the below SQL query to resolve the issue:
    {sqlgenerated}
    Make sure the updated SQL query aligns with the requirements provided in the initial question."""
    prompts.append(prompt)

  5. After the SQL is generated, the Athena client is invoked to run and generate the output:
    query_execution = self.athena_client.start_query_execution(
    QueryString=query_string,
    ResultConfiguration=result_config,
    QueryExecutionContext=query_execution_context, )
    execution_id = query_execution["QueryExecutionId"]

Test the solution

In this section, we run our solution with different example scenarios to test different complexity levels of SQL queries.

To test our text-to-SQL, we use two datasets available from IMDB. Subsets of IMDb data are available for personal and non-commercial use. You can download the datasets and store them in Amazon Simple Storage Service (Amazon S3). You can use the following Spark SQL snippet to create tables in AWS Glue. For this example, we use title_ratings and title:

source_title_ratings3_path = 's3://llm-athena-output/input_data/title.ratings.tsv'
target_title_s3_path='s3://llm-athena-output/output_data/imdb_stg/title_ratings'
source_titleratingdf=spark.read.csv(source_title_ratings3_path,sep="t",header=True)
source_titleratingdf.write.mode('overwrite').format('parquet').option('path', target_title_s3_path).saveAsTable('imdb_stg.title_ratings')

Store data in Amazon S3 and metadata in AWS Glue

In this scenario, our dataset is stored in an S3 bucket. Athena has an S3 connector that allows you to use Amazon S3 as a data source that can be queried.

For our first query, we provide the input “I am new to this. Can you help me see all the tables and columns in imdb schema?”

The following is the generated query:

WITH tables AS (
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'imdb_stg'),
columns AS (
SELECT
c.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM information_schema.columns c
WHERE c.table_schema = 'imdb_stg')
SELECT
t.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM tables t
INNER JOIN columns c
ON t.table_name = c.table_name
ORDER BY
t.table_name,
c.ordinal_position
LIMIT 10;

The following screenshot and code show our output.

Table showing the attributes of the example IMBD dataset.

table_name column_name data_type
0 title titleid varchar
1 title ordering integer
2 title title varchar
3 title region varchar
4 title language varchar

For our second query, we ask “Show me all the title and details in US region whose rating is more than 9.5.”

The following is our generated query:

WITH us_titles AS (
SELECT t.title, t.region, tr.averageRating, tr.numVotes
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.region = 'US' AND cast(tr.averageRating as varchar) > '9.5'
)
SELECT title, region, averageRating, numVotes
FROM us_titles
LIMIT 100;

The response is as follows.

A table showing the results of the sql query

title region averageRating numVotes
0 The Way You Saw Me US 9.7 8
1 The Brother Side of the Wake US 9.6 20
2 Ignis Fatuus US 9.6 11
3 Love and Hip Hop Atlanta US 9.9 11
4 ronny/lily US 9.7 14781

For our third query, we enter “Great Response! Now show me all the original type titles having ratings more than 7.5 and not in the US region.”

The following query is generated:

WITH titles AS (
SELECT t.titleId,
t.title,
t.types,
t.isOriginalTitle,
cast(tr.averageRating as decimal(3,1)) as averageRating,
tr.numVotes,
t.region
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.isOriginalTitle = '1'
AND cast(tr.averageRating as decimal(3,1)) > 7.5
AND t.region != 'US')
SELECT *
FROM titles
LIMIT 100;

We get the following results.

A single row showing the result of the SQL query.

titleId title types isOriginalTitle averageRating numVotes region
0 tt0986264 Taare Zameen Par original 1 8.3 203760 XWW

Generate self-corrected SQL

This scenario simulates a SQL query that has syntax issues. Here, the generated SQL will be self-corrected based on the response from Athena. In the following response, Athena gave a COLUMN_NOT_FOUND error and mentioned that table_description can’t be resolved:

Status : {'State': 'FAILED', 'StateChangeReason': "COLUMN_NOT_FOUND: line 1:50: Column 'table_description' 
cannot be resolved or requester is not authorized to access requested resources",
'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 501000, tzinfo=tzlocal()),
'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 778000, tzinfo=tzlocal()),
'AthenaError': {'ErrorCategory': 2, 'ErrorType': 1006, 'Retryable': False, 'ErrorMessage': "COLUMN_NOT_FOUND: 
line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to 
access requested resources"}}
COLUMN_NOT_FOUND: line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to access requested resources
Try Count: 2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,Try Count: 2
we are in Try block to generate the sql and count is :2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,we are in Try block to generate the sql and count is :2
Executing: Explain WITH tables AS ( SELECT table_name FROM information_schema.tables WHERE table_schema = 'imdb_stg' ), columns AS ( SELECT c.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM information_schema.columns c WHERE c.table_schema = 'imdb_stg' ) SELECT t.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM tables t INNER JOIN columns c ON t.table_name = c.table_name ORDER BY t.table_name, c.ordinal_position LIMIT 10;
I am checking the syntax here
execution_id: 904857c3-b7ac-47d0-8e7e-6b9d0456099b
Status : {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 29, 537000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 30, 183000, tzinfo=tzlocal())}
syntax checked for query passed in tries number :2

Using the solution with other data sources

To use the solution with other data sources, Athena handles the job for you. To do this, Athena uses data source connectors that can be used with federated queries. You can consider a connector as an extension of the Athena query engine. Pre-built Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon Relational Database Service (Amazon RDS), and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. After you set up a connection to any data source, you can use the preceding code base to extend the solution. For more information, refer to Query any data source with Amazon Athena’s new federated query.

Clean up

To clean up the resources, you can start by cleaning up your S3 bucket where the data resides. Unless your application invokes Amazon Bedrock, it will not incur any cost. For the sake of infrastructure management best practices, we recommend deleting the resources created in this demonstration.

Conclusion

In this post, we presented a solution that allows you to use NLP to generate complex SQL queries with a variety of resources enabled by Athena. We also increased the accuracy of the generated SQL queries via a multi-step evaluation loop based on error messages from downstream processes. Additionally, we used the metadata in the AWS Glue Data Catalog to consider the table names asked in the query through the RAG framework. We then tested the solution in various realistic scenarios with different query complexity levels. Finally, we discussed how to apply this solution to different data sources supported by Athena.

Amazon Bedrock is at the center of this solution. Amazon Bedrock can help you build many generative AI applications. To get started with Amazon Bedrock, we recommend following the quick start in the following GitHub repo and familiarizing yourself with building generative AI applications. You can also try knowledge bases in Amazon Bedrock to build such RAG solutions quickly.


About the Authors

Sanjeeb Panda is a Data and ML engineer at Amazon. With the background in AI/ML, Data Science and Big Data, Sanjeeb design and develop innovative data and ML solutions that solve complex technical challenges and achieve strategic goals for global 3P sellers managing their businesses on Amazon. Outside of his work as a Data and ML engineer at Amazon, Sanjeeb Panda is an avid foodie and music enthusiast.

Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.