Amazon SageMaker Studio provides a fully managed solution for data scientists to interactively build, train, and deploy machine learning (ML) models. In the process of working on their ML tasks, data scientists typically start their workflow by discovering relevant data sources and connecting to them. They then use SQL to explore, analyze, visualize, and integrate data from various sources before using it in their ML training and inference. Previously, data scientists often found themselves juggling multiple tools to support SQL in their workflow, which hindered productivity.
We’re excited to announce that JupyterLab notebooks in SageMaker Studio now come with built-in support for SQL. Data scientists can now:
In addition, administrators can securely manage connections to these data services, allowing data scientists to access authorized data without the need to manage credentials manually.
In this post, we guide you through setting up this feature in SageMaker Studio, and walk you through various capabilities of this feature. Then we show how you can enhance the in-notebook SQL experience using Text-to-SQL capabilities provided by advanced large language models (LLMs) to write complex SQL queries using natural language text as input. Finally, to enable a broader audience of users to generate SQL queries from natural language input in their notebooks, we show you how to deploy these Text-to-SQL models using Amazon SageMaker endpoints.
With SageMaker Studio JupyterLab notebook’s SQL integration, you can now connect to popular data sources like Snowflake, Athena, Amazon Redshift, and Amazon DataZone. This new feature enables you to perform various functions.
For example, you can visually explore data sources like databases, tables, and schemas directly from your JupyterLab ecosystem. If your notebook environments are running on SageMaker Distribution 1.6 or higher, look for a new widget on the left side of your JupyterLab interface. This addition enhances data accessibility and management within your development environment.
If you’re not currently on suggested SageMaker Distribution (1.5 or lower) or in a custom environment, refer to appendix for more information.
After you have set up connections (illustrated in the next section), you can list data connections, browse databases and tables, and inspect schemas.
The SageMaker Studio JupyterLab built-in SQL extension also enables you to run SQL queries directly from a notebook. Jupyter notebooks can differentiate between SQL and Python code using the %%sm_sql
magic command, which must be placed at the top of any cell that contains SQL code. This command signals to JupyterLab that the following instructions are SQL commands rather than Python code. The output of a query can be displayed directly within the notebook, facilitating seamless integration of SQL and Python workflows in your data analysis.
The output of a query can be displayed visually as HTML tables, as shown in the following screenshot.
They can also be written to a pandas DataFrame.
Make sure you have satisfied the following prerequisites in order to use the SageMaker Studio notebook SQL experience:
%load_ext amazon_sagemaker_sql_magic
to enable this feature. Additionally, you can run the %sm_sql?
command to view a comprehensive list of supported options for querying from a SQL cell. These options include setting a default query limit of 1,000, running a full extraction, and injecting query parameters, among others. This setup allows for flexible and efficient SQL data manipulation directly within your notebook environment.The built-in SQL browsing and execution capabilities of SageMaker Studio are enhanced by AWS Glue connections. An AWS Glue connection is an AWS Glue Data Catalog object that stores essential data such as login credentials, URI strings, and virtual private cloud (VPC) information for specific data stores. These connections are used by AWS Glue crawlers, jobs, and development endpoints to access various types of data stores. You can use these connections for both source and target data, and even reuse the same connection across multiple crawlers or extract, transform, and load (ETL) jobs.
To explore SQL data sources in the left pane of SageMaker Studio, you first need to create AWS Glue connection objects. These connections facilitate access to different data sources and allow you to explore their schematic data elements.
In the following sections, we walk through the process of creating SQL-specific AWS Glue connectors. This will enable you to access, view, and explore datasets across a variety of data stores. For more detailed information about AWS Glue connections, refer to Connecting to data.
The only way to bring data sources into SageMaker Studio is with AWS Glue connections. You need to create AWS Glue connections with specific connection types. As of this writing, the only supported mechanism of creating these connections is using the AWS Command Line Interface (AWS CLI).
When connecting to different data sources in AWS Glue, you must first create a JSON file that defines the connection properties—referred to as the connection definition file. This file is crucial for establishing an AWS Glue connection and should detail all the necessary configurations for accessing the data source. For security best practices, it’s recommended to use Secrets Manager to securely store sensitive information such as passwords. Meanwhile, other connection properties can be managed directly through AWS Glue connections. This approach makes sure that sensitive credentials are protected while still making the connection configuration accessible and manageable.
The following is an example of a connection definition JSON:
When setting up AWS Glue connections for your data sources, there are a few important guidelines to follow to provide both functionality and security:
PythonProperties
key, make sure all properties are stringified key-value pairs. It’s crucial to properly escape double-quotes by using the backslash () character where necessary. This helps maintain the correct format and avoid syntax errors in your JSON.PythonProperties
, it is advisable not to include sensitive details like passwords directly in these properties. Instead, use Secrets Manager for handling sensitive information. This approach secures your sensitive data by storing it in a controlled and encrypted environment, away from the main configuration files.After you include all the necessary fields in your connection definition JSON file, you’re ready to establish an AWS Glue connection for your data source using the AWS CLI and the following command:
This command initiates a new AWS Glue connection based on the specifications detailed in your JSON file. The following is a quick breakdown of the command components:
You should be able to create AWS Glue connections with the preceding AWS CLI command from your Studio JupyterLab terminal. On the File menu, choose New and Terminal.
If the create-connection
command runs successfully, you should see your data source listed in the SQL browser pane. If you don’t see your data source listed, choose Refresh to update the cache.
In this section, we focus on integrating a Snowflake data source with SageMaker Studio. Creating Snowflake accounts, databases, and warehouses falls outside the scope of this post. To get started with Snowflake, refer to the Snowflake user guide. In this post, we concentrate on creating a Snowflake definition JSON file and establishing a Snowflake data source connection using AWS Glue.
You can connect to your Snowflake account by either using a user ID and password or using private keys. To connect with a user ID and password, you need to securely store your credentials in Secrets Manager. As mentioned previously, although it’s possible to embed this information under PythonProperties, it is not recommended to store sensitive information in plain text format. Always make sure that sensitive data is handled securely to avoid potential security risks.
To store information in Secrets Manager, complete the following steps:
sm-sql-snowflake-secret
.As discussed earlier, AWS Glue connections are essential for accessing any connection from SageMaker Studio. You can find a list of all supported connection properties for Snowflake. The following is a sample connection definition JSON for Snowflake. Replace the placeholder values with the appropriate values before saving it to disk:
To create an AWS Glue connection object for the Snowflake data source, use the following command:
This command creates a new Snowflake data source connection in your SQL browser pane that’s browsable, and you can run SQL queries against it from your JupyterLab notebook cell.
Amazon Redshift is a fully managed, petabyte-scale data warehouse service that simplifies and reduces the cost of analyzing all your data using standard SQL. The procedure for creating an Amazon Redshift connection closely mirrors that for a Snowflake connection.
Similar to the Snowflake setup, to connect to Amazon Redshift using a user ID and password, you need to securely store the secrets information in Secrets Manager. Complete the following steps:
sm-sql-redshift-secret
.By following these steps, you make sure your connection credentials are handled securely, using the robust security features of AWS to manage sensitive data effectively.
To set up a connection with Amazon Redshift using a JSON definition, fill in the necessary fields and save the following JSON configuration to disk:
To create an AWS Glue connection object for the Redshift data source, use the following AWS CLI command:
This command creates a connection in AWS Glue linked to your Redshift data source. If the command runs successfully, you will be able to see your Redshift data source within the SageMaker Studio JupyterLab notebook, ready for running SQL queries and performing data analysis.
Athena is a fully managed SQL query service from AWS that enables analysis of data stored in Amazon S3 using standard SQL. To set up an Athena connection as a data source in the JupyterLab notebook’s SQL browser, you need to create an Athena sample connection definition JSON. The following JSON structure configures the necessary details to connect to Athena, specifying the data catalog, the S3 staging directory, and the Region:
To create an AWS Glue connection object for the Athena data source, use the following AWS CLI command:
If the command is successful, you will be able to access Athena data catalog and tables directly from the SQL browser within your SageMaker Studio JupyterLab notebook.
If you have multiple data sources integrated into SageMaker Studio through the built-in SQL browser and the notebook SQL feature, you can quickly run queries and effortlessly switch between data source backends in subsequent cells within a notebook. This capability allows for seamless transitions between different databases or data sources during your analysis workflow.
You can run queries against a diverse collection of data source backends and bring the results directly into the Python space for further analysis or visualization. This is facilitated by the %%sm_sql
magic command available in SageMaker Studio notebooks. To output the results of your SQL query into a pandas DataFrame, there are two options:
%%sm_sql
command: The following diagram illustrates this workflow and showcases how you can effortlessly run queries across various sources in subsequent notebook cells, as well as train a SageMaker model using training jobs or directly within the notebook using local compute. Additionally, the diagram highlights how the built-in SQL integration of SageMaker Studio simplifies the processes of extraction and building directly within the familiar environment of a JupyterLab notebook cell.
SQL is a complex language that requires an understanding of databases, tables, syntaxes, and metadata. Today, generative artificial intelligence (AI) can enable you to write complex SQL queries without requiring in-depth SQL experience. The advancement of LLMs has significantly impacted natural language processing (NLP)-based SQL generation, allowing for the creation of precise SQL queries from natural language descriptions—a technique referred to as Text-to-SQL. However, it is essential to acknowledge the inherent differences between human language and SQL. Human language can sometimes be ambiguous or imprecise, whereas SQL is structured, explicit, and unambiguous. Bridging this gap and accurately converting natural language into SQL queries can present a formidable challenge. When provided with appropriate prompts, LLMs can help bridge this gap by understanding the intent behind the human language and generating accurate SQL queries accordingly.
With the release of the SageMaker Studio in-notebook SQL query feature, SageMaker Studio makes it straightforward to inspect databases and schemas, and author, run, and debug SQL queries without ever leaving the Jupyter notebook IDE. This section explores how the Text-to-SQL capabilities of advanced LLMs can facilitate the generation of SQL queries using natural language within Jupyter notebooks. We employ the cutting-edge Text-to-SQL model defog/sqlcoder-7b-2
in conjunction with Jupyter AI, a generative AI assistant specifically designed for Jupyter notebooks, to create complex SQL queries from natural language. By using this advanced model, we can effortlessly and efficiently create complex SQL queries using natural language, thereby enhancing our SQL experience within notebooks.
To begin prototyping, you need the following:
defog/sqlcoder-7b-2
model, a 7B parameter model, using an ml.g5.2xlarge instance is recommended. Alternatives such as defog/sqlcoder-70b-alph
a or defog/sqlcoder-34b-alpha
are also viable for natural language to SQL conversion, but larger instance types may be required for prototyping. Make sure you have the quota to launch a GPU-backed instance by navigating to the Service Quotas console, searching for SageMaker, and searching for Studio JupyterLab Apps running on <instance type>
.Launch a new GPU-backed JupyterLab Space from your SageMaker Studio. It’s recommended to create a new JupyterLab Space with at least 75 GB of Amazon Elastic Block Store (Amazon EBS) storage for a 7B parameter model.
AutoModelForCausalLM
class from huggingface/transformers to automatically download models and pin them to your local GPUs. The model weights will be stored in your local machine’s cache. See the following code: After the model has been fully downloaded and loaded into memory, you should observe an increase in GPU utilization on your local machine. This indicates that the model is actively using the GPU resources for computational tasks. You can verify this in your own JupyterLab space by running nvidia-smi
(for a one-time display) or nvidia-smi —loop=1
(to repeat every second) from your JupyterLab terminal.
Text-to-SQL models excel at understanding the intent and context of a user’s request, even when the language used is conversational or ambiguous. The process involves translating natural language inputs into the correct database schema elements, such as table names, column names, and conditions. However, an off-the-shelf Text-to-SQL model will not inherently know the structure of your data warehouse, the specific database schemas, or be able to accurately interpret the content of a table based solely on column names. To effectively use these models to generate practical and efficient SQL queries from natural language, it is necessary to adapt the SQL text-generation model to your specific warehouse database schema. This adaptation is facilitated through the use of LLM prompts. The following is a recommended prompt template for the defog/sqlcoder-7b-2 Text-to-SQL model, divided into four parts:
An example of the database schema and prompt used in this section is available in the GitHub Repo.
Prompt engineering is not just about forming questions or statements; it’s a nuanced art and science that significantly impacts the quality of interactions with an AI model. The way you craft a prompt can profoundly influence the nature and usefulness of the AI’s response. This skill is pivotal in maximizing the potential of AI interactions, especially in complex tasks requiring specialized understanding and detailed responses.
It’s important to have the option to quickly build and test a model’s response for a given prompt and optimize the prompt based on the response. JupyterLab notebooks provide the ability to receive instant model feedback from a model running on local compute and optimize the prompt and tune a model’s response further or change a model entirely. In this post, we use a SageMaker Studio JupyterLab notebook backed by ml.g5.2xlarge’s NVIDIA A10G 24 GB GPU to run Text-to-SQL model inference on the notebook and interactively build our model prompt until the model’s response is sufficiently tuned to provide responses that are directly executable in JupyterLab’s SQL cells. To run model inference and simultaneously stream model responses, we use a combination of model.generate
and TextIteratorStreamer
as defined in the following code:
The model’s output can be decorated with SageMaker SQL magic %%sm_sql ...
, which allows the JupyterLab notebook to identify the cell as a SQL cell.
At the end of the prototyping stage, we have selected our preferred Text-to-SQL LLM, an effective prompt format, and an appropriate instance type for hosting the model (either single-GPU or multi-GPU). SageMaker facilitates the scalable hosting of custom models through the use of SageMaker endpoints. These endpoints can be defined according to specific criteria, allowing for the deployment of LLMs as endpoints. This capability enables you to scale the solution to a wider audience, allowing users to generate SQL queries from natural language inputs using custom hosted LLMs. The following diagram illustrates this architecture.
To host your LLM as a SageMaker endpoint, you generate several artifacts.
The first artifact is model weights. SageMaker Deep Java Library (DJL) Serving containers allow you to set up configurations through a meta serving.properties file, which enables you to direct how models are sourced—either directly from the Hugging Face Hub or by downloading model artifacts from Amazon S3. If you specify model_id=defog/sqlcoder-7b-2
, DJL Serving will attempt to directly download this model from the Hugging Face Hub. However, you may incur networking ingress/egress charges each time the endpoint is deployed or elastically scaled. To avoid these charges and potentially speed up the download of model artifacts, it is recommended to skip using model_id
in serving.properties
and save model weights as S3 artifacts and only specify them with s3url=s3://path/to/model/bin
.
Saving a model (with its tokenizer) to disk and uploading it to Amazon S3 can be accomplished with just a few lines of code:
You also use a database prompt file. In this setup, the database prompt is composed of Task
, Instructions
, Database Schema
, and Answer sections
. For the current architecture, we allocate a separate prompt file for each database schema. However, there is flexibility to expand this setup to include multiple databases per prompt file, allowing the model to run composite joins across databases on the same server. During our prototyping stage, we save the database prompt as a text file named <Database-Glue-Connection-Name>.prompt
, where Database-Glue-Connection-Name
corresponds to the connection name visible in your JupyterLab environment. For instance, this post refers to a Snowflake connection named Airlines_Dataset
, so the database prompt file is named Airlines_Dataset.prompt
. This file is then stored on Amazon S3 and subsequently read and cached by our model serving logic.
Moreover, this architecture permits any authorized users of this endpoint to define, store, and generate natural language to SQL queries without the need for multiple redeployments of the model. We use the following example of a database prompt to demonstrate the Text-to-SQL functionality.
Next, you generate custom model service logic. In this section, you outline a custom inference logic named model.py. This script is designed to optimize the performance and integration of our Text-to-SQL services:
Additionally, we include a serving.properties
file, which acts as a global configuration file for models hosted using DJL serving. For more information, refer to Configurations and settings.
Lastly, you can also include a requirements.txt
file to define additional modules required for inference and package everything into a tarball for deployment.
See the following code:
Jupyter AI is an open source tool that brings generative AI to Jupyter notebooks, offering a robust and user-friendly platform for exploring generative AI models. It enhances productivity in JupyterLab and Jupyter notebooks by providing features like the %%ai magic for creating a generative AI playground inside notebooks, a native chat UI in JupyterLab for interacting with AI as a conversational assistant, and support for a wide array of LLMs from providers like Amazon Titan, AI21, Anthropic, Cohere, and Hugging Face or managed services like Amazon Bedrock and SageMaker endpoints. For this post, we use Jupyter AI’s out-of-the-box integration with SageMaker endpoints to bring the Text-to-SQL capability into JupyterLab notebooks. The Jupyter AI tool comes pre-installed in all SageMaker Studio JupyterLab Spaces backed by SageMaker Distribution images; end-users are not required to make any additional configurations to start using the Jupyter AI extension to integrate with a SageMaker hosted endpoint. In this section, we discuss the two ways to use the integrated Jupyter AI tool.
Jupyter AI’s %%ai
magic command allows you to transform your SageMaker Studio JupyterLab notebooks into a reproducible generative AI environment. To begin using AI magics, make sure you have loaded the jupyter_ai_magics extension to use %%ai
magic, and additionally load amazon_sagemaker_sql_magic
to use %%sm_sql
magic:
To run a call to your SageMaker endpoint from your notebook using the %%ai
magic command, provide the following parameters and structure the command as follows:
For example, the command in a Jupyter notebook cell might look like the following code:
Alternatively, you can interact with SageMaker endpoints through a built-in user interface, simplifying the process of generating queries or engaging in dialogue. Before beginning to chat with your SageMaker endpoint, configure the relevant settings in Jupyter AI for the SageMaker endpoint, as shown in the following screenshot.
SageMaker Studio now simplifies and streamlines the data scientist workflow by integrating SQL support into JupyterLab notebooks. This allows data scientists to focus on their tasks without the need to manage multiple tools. Furthermore, the new built-in SQL integration in SageMaker Studio enables data personas to effortlessly generate SQL queries using natural language text as input, thereby accelerating their workflow.
We encourage you to explore these features in SageMaker Studio. For more information, refer to Prepare data with SQL in Studio.
If you’re not using a SageMaker Distribution image or using Distribution images 1.5 or below, run the following commands to enable the SQL browsing feature inside your JupyterLab environment:
JupyterLab widgets allow for relocation. Depending on your preference, you can move widgets to the either side of JupyterLab widgets pane. If you prefer, you can move the direction of the SQL widget to the opposite side (right to left) of the sidebar with a simple right-click on the widget icon and choosing Switch Sidebar Side.
TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…
Whether a company begins with a proof-of-concept or live deployment, they should start small, test…
Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…
Machine learning (ML) models are built upon data.
Editor’s note: This is the second post in a series that explores a range of…
David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…