Amazon SageMaker Studio Lab is a free machine learning (ML) development environment based on open-source JupyterLab for anyone to learn and experiment with ML using AWS ML compute resources. It’s based on the same architecture and user interface as Amazon SageMaker Studio, but with a subset of Studio capabilities.
When you begin working on ML initiatives, you need to perform exploratory data analysis (EDA) or data preparation before proceeding with model building. Amazon SageMaker Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications via a visual interface. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes.
A key accelerator of feature preparation in Data Wrangler is the Data Quality and Insights Report. This report checks data quality and helps detect abnormalities in your data, so that you can perform the required data engineering to fix your dataset. You can use the Data Quality and Insights Report to perform an analysis of your data to gain insights into your dataset such as the number of missing values and number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention and help you identify the data preparation steps you need to perform.
Studio Lab users can benefit from Data Wrangler because data quality and feature engineering are critical for the predictive performance of your model. Data Wrangler helps with data quality and feature engineering by giving insights into data quality issues and easily enabling rapid feature iteration and engineering using a low-code UI.
In this post, we show you how to perform exploratory data analysis, prepare and transform data using Data Wrangler, and export the transformed and prepared data to Studio Lab to carry out model building.
The solution includes the following high-level steps:
The following diagram illustrates this workflow.
To use Data Wrangler and Studio Lab, you need the following prerequisites:
sagemaker-{region}-{account_id}
), or create your own S3 bucket.To get started, complete the following steps:
Let’s use the Data Quality and Insights Report to perform an analysis of the data that we imported into Data Wrangler. You can use the report to understand what steps you need to take to clean and process your data. This report provides information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.
You’re presented with a detailed report that you can review and download. The report includes several sections such as quick model, feature summary, feature correlation, and data insights. The following screenshots provide examples of these sections.
From the report, we can make the following observations:
State
column appears to be quite evenly distributed, so the data is balanced in terms of state population.Phone
column presents too many unique values to be of any practical use. Too many unique values make this column not useful. We can drop the Phone
column in our transformation.Mins
and Charge
are highly correlated. We can remove one of them.Based on our observations, we want to make the following transformations:
Phone
column because it has many unique values.Day Charge
from the pair with Day Mins
, Night Charge
from the pair with Night Mins
, and Intl Charge
from the pair with Intl Mins
.True
or False
in the Churn
column to be a numerical value of 1 or 0.Phone
, Day Charge
, Eve Charge
, Night Charge
, and Intl Charge
.Churn?
column.Churn?
column.Now True
and False
are converted to 1 and 0, respectively.
Now that we have a good understand of the data and have prepared and transformed the data for model building, we can move the data to Studio Lab for model building.
To start using the data in Studio Lab, complete the following steps:
churn
to confirm the dataset is correct.Now that you have the processed dataset in Studio Lab, you can carry out further steps required for model building.
You can perform all the steps in this post for EDA or data preparation within Data Wrangler and pay for the simple instance, jobs, and storage pricing based on usage or consumption. No upfront or licensing fees are required.
When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees. To avoid losing work, save your data flow before shutting Data Wrangler down.
sagemaker-data-wrangler-1.0 app
.Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.
After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.
In this post, we saw how you can gain insights into your dataset, perform exploratory data analysis, prepare and transform data using Data Wrangler within Studio, and export the transformed and prepared data to Studio Lab and carry out model building and other steps.
With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.
Our next iteration of the FSF sets out stronger security protocols on the path to…
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this…
Generative AI has revolutionized technology through generating content and solving complex problems. To fully take…
At Google Cloud, we're deeply invested in making AI helpful to organizations everywhere — not…
Advanced Micro Devices reported revenue of $7.658 billion for the fourth quarter, up 24% from…