sagemaker rstudio redshift arch diagram
Last year, we announced the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.
Many of the RStudio on SageMaker users are also users of Amazon Redshift, a fully managed, petabyte-scale, massively parallel data warehouse for data storage and analytical workloads. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Users can also interact with data with ODBC, JDBC, or the Amazon Redshift Data API.
The use of RStudio on SageMaker and Amazon Redshift can be helpful for efficiently performing analysis on large data sets in the cloud. However, working with data in the cloud can present challenges, such as the need to remove organizational data silos, maintain security and compliance, and reduce complexity by standardizing tooling. AWS offers tools such as RStudio on SageMaker and Amazon Redshift to help tackle these challenges.
In this blog post, we will show you how to use both of these services together to efficiently perform analysis on massive data sets in the cloud while addressing the challenges mentioned above. This blog focuses on the Rstudio on Amazon SageMaker language, with business analysts, data engineers, data scientists, and all developers that use the R Language and Amazon Redshift, as the target audience.
If you’d like to use the traditional SageMaker Studio experience with Amazon Redshift, refer to Using the Amazon Redshift Data API to interact from an Amazon SageMaker Jupyter notebook.
In the blog today, we will be executing the following steps:
This process is depicted in the following solutions architecture:
Prior to getting started, ensure you have all requirements for setting up RStudio on Amazon SageMaker and Amazon Redshift Serverless, such as:
We will be using a CloudFormation stack to generate the required infrastructure.
Note: If you already have an RStudio domain and Amazon Redshift cluster you can skip this step
Launching this stack creates the following resources:
This template is designed to work in a Region (ex. us-east-1
, us-west-2
) with three Availability Zones, RStudio on SageMaker, and Amazon Redshift Serverless. Ensure your Region has access to those resources, or modify the templates accordingly.
Press the Launch Stack button to create the stack.
The template will generate five stacks.
Once the stack status is CREATE_COMPLETE, navigate to the Amazon Redshift Serverless console. This is a new capability that makes it super easy to run analytics in the cloud with high performance at any scale. Just load your data and start querying. There is no need to set up and manage clusters.
Note: The pattern demonstrated in this blog integrating Amazon Redshift and RStudio on Amazon SageMaker will be the same regardless of Amazon Redshift deployment pattern (serverless or traditional cluster).
The CloudFormation script created a database called sagemaker
. Let’s populate this database with tables for the RStudio user to query. Create a SQL editor tab and be sure the sagemaker
database is selected. We will be using the synthetic credit card transaction data to create tables in our database. This data is part of the SageMaker sample tabular datasets s3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions
.
We are going to execute the following query in the query editor. This will generate three tables, cards, transactions, and users.
You can validate that the query ran successfully by seeing three tables within the left-hand pane of the query editor.
Once all of the tables are populated, navigate to SageMaker RStudio and start a new session with RSession base image on an ml.m5.xlarge instance.
Once the session is launched, we will run this code to create a connection to our Amazon Redshift Serverless database.
In order to view the tables in the synthetic schema, you will need to grant access in Amazon Redshift via the query editor.
The RStudio Connections pane should show the sagemaker
database with schema synthetic and tables cards, transactions, users.
You can click the table icon next to the tables to view 1,000 records.
Note: We have created a pre-built R Markdown file with all the code-blocks pre-built that can be found at the project GitHub repo.
Now let’s use the DBI
package function dbListTables()
to view existing tables.
Use dbGetQuery() to pass a SQL query to the database.
We can also use the dbplyr
and dplyr
packages to execute queries in the database. Let’s count()
how many transactions are in the transactions table. But first, we need to install these packages.
Use the tbl()
function while specifying the schema.
Let’s run a count of the number of rows for each table.
So we have 2,000 users; 6,146 cards; and 24,386,900 transactions. We can also view the tables in the console.
transactions_tbl
We can also view what dplyr
verbs are doing under the hood.
Let’s visually explore the number of transactions by year.
We can also summarize data in the database as follows:
Suppose we want to view fraud using card information. We just need to join the tables and then group them by the attribute.
Now let’s prepare a dataset that could be used for machine learning. Let’s filter the transaction data to just include Discover credit cards while only keeping a subset of columns.
And now let’s do some cleaning using the following transformations:
is_fraud
to binary attributeuse_chip
and rename it to typeNow that we have filtered and cleaned our dataset, we are ready to collect this dataset into local RAM.
Now we have a working dataset to start creating features and fitting models. We will not cover those steps in this blog, but if you want to learn more about building models in RStudio on SageMaker refer to Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists.
To clean up any resources to avoid incurring recurring costs, delete the root CloudFormation template. Also delete all EFS mounts created and any S3 buckets and objects created.
Data analysis and modeling can be challenging when working with large datasets in the cloud. Amazon Redshift is a popular data warehouse that can help users perform these tasks. RStudio, one of the most widely used integrated development environments (IDEs) for data analysis, is often used with R language. In this blog post, we showed how to use Amazon Redshift and RStudio on SageMaker together to efficiently perform analysis on massive datasets. By using RStudio on SageMaker, users can take advantage of the fully managed infrastructure, access control, networking, and security capabilities of SageMaker, while also simplifying integration with Amazon Redshift. If you would like to learn more about using these two tools together, check out our other blog posts and resources. You can also try using RStudio on SageMaker and Amazon Redshift for yourself and see how they can help you with your data analysis and modeling tasks.
Please add your feedback to this blog, or create a pull request on the GitHub.
Model: https://huggingface.co/silveroxides/Chroma-GGUF/blob/main/chroma-unlocked-v38-detail-calibrated/chroma-unlocked-v38-detail-calibrated-Q8_0.gguf Workflow: https://huggingface.co/lodestones/Chroma/resolve/main/simple_workflow.json Prompts used: High detail photo showing an abandoned Renaissance painter’s studio…
This post is divided into three parts; they are: • Low-Rank Approximation of Matrices •…
Pandas DataFrames are powerful and versatile data manipulation and analysis tools.
Palantir FedStart and the Path to CMMC ComplianceSecuring the Defense Industrial BaseNever has the imperative…
Time series forecasting helps businesses predict future trends based on historical data patterns, whether it’s…
MIT researchers developed SEAL, a framework that lets language models continuously learn new knowledge and…