Picture1 3
Exploratory data analysis (EDA) is a common task performed by business analysts to discover patterns, understand relationships, validate assumptions, and identify anomalies in their data. In machine learning (ML), it’s important to first understand the data and its relationships before getting into model building. Traditional ML development cycles can sometimes take months and require advanced data science and ML engineering skills, whereas no-code ML solutions can help companies accelerate the delivery of ML solutions to days or even hours.
Amazon SageMaker Canvas is a no-code ML tool that helps business analysts generate accurate ML predictions without having to write code or without requiring any ML experience. Canvas provides an easy-to-use visual interface to load, cleanse, and transform the datasets, followed by building ML models and generating accurate predictions.
In this post, we walk through how to perform EDA to gain a better understanding of your data before building your ML model, thanks to Canvas’ built-in advanced visualizations. These visualizations help you analyze the relationships between features in your datasets and comprehend your data better. This is done intuitively, with the ability to interact with the data and discover insights that may go unnoticed with ad hoc querying. They can be created quickly through the ‘Data visualizer’ within Canvas prior to building and training ML models.
These visualizations add to the range of capabilities for data preparation and exploration already offered by Canvas, including the ability to correct missing values and replace outliers; filter, join, and modify datasets; and extract specific time values from timestamps. To learn more about how Canvas can help you cleanse, transform, and prepare your dataset, check out Prepare data with advanced transformations.
For our use case, we look at why customers churn in any business and illustrate how EDA can help from a viewpoint of an analyst. The dataset we use in this post is a synthetic dataset from a telecommunications mobile phone carrier for customer churn prediction that you can download (churn.csv), or you bring your own dataset to experiment with. For instructions on importing your own dataset, refer to Importing data in Amazon SageMaker Canvas.
Follow the instructions in Prerequisites for setting up Amazon SageMaker Canvas before you proceed further.
To import the sample dataset to Canvas, complete the following steps:
The following screenshot shows our preview.
The model preview indicates that the model predicts the correct target (churn?) 95.6% of the time. You can also see the initial column impact (influence each column has on the target column). Let’s do some data exploration, visualization, and transformation, and then proceed to build a model.
Canvas already provides some common basic visualizations, such as data distribution in a grid view on the Build tab. These are great for getting a high-level overview of the data, understanding how the data is distributed, and getting a summary overview of the dataset.
As a business analyst, you may need to get high-level insights on how the data is distributed as well as how the distribution reflects against the target column (churn) to easily understand the data relationship before building the model. You can now choose Grid view to get an overview of the data distribution.
The following screenshot shows the overview of the distribution of the dataset.
We can make the following observations:
Let’s go deeper and check out the advanced visualizations available in Canvas.
As business analysts, you want to see if there are relationships between data elements, and how they’re related to churn. With Canvas, you can explore and visualize your data, which helps you gain advanced insights into your data before building your ML models. You can visualize using scatter plots, bar charts, and box plots, which can help you understand your data and discover the relationships between features that could affect the model accuracy.
To start creating your visualizations, complete the following steps:
A key accelerator of visualization in Canvas is the Data visualizer. Let’s change the sample size to get a better perspective.
You may want to change the sample size based on your dataset. In some cases, you may have a few hundred to a few thousand rows where you can select the entire dataset. In some cases, you may have several thousand rows, in which case you may select a few hundred or a few thousand rows based on your use case.
A scatter plot shows the relationship between two quantitative variables measured for the same individuals. In our case, it’s important to understand the relationship between values to check for correlation.
Because we have Calls, Mins, and Charge, we will plot the correlation between them for Day, Evening, and Night.
First, let’s create a scatter plot between Day Charge vs. Day Mins.
We can observe that as Day Mins increases, Day Charge also increases.
The same applies for evening calls.
Night calls also have the same pattern.
Because mins and charge seem to increase linearly, you can observe that they have a high correlation with one another. Including these feature pairs in some ML algorithms can take additional storage and reduce the speed of training, and having similar information in more than one column might lead to the model overemphasizing the impacts and lead to undesired bias in the model. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, and Intl Charge from the pair with Intl Mins.
A bar chart is a plot between a categorical variable on the x-axis and numerical variable on y-axis to explore the relationship between both variables. Let’s create a bar chart to see the how the calls are distributed across our target column Churn for True and False. Choose Bar chart and drag and drop day calls and churn to the y-axis and x-axis, respectively.
Now, let’s create same bar chart for evening calls vs churn.
Next, let’s create a bar chart for night calls vs. churn.
It looks like there is a difference in behavior between customers who have churned and those that didn’t.
Box plots are useful because they show differences in behavior of data by class (churn or not). Because we’re going to predict churn (target column), let’s create a box plot of some features against our target column to infer descriptive statistics on the dataset such as mean, max, min, median, and outliers.
Choose Box plot and drag and drop Day mins and Churn to the y-axis and x-axis, respectively.
You can also try the same approach to other columns against our target column (churn).
Let’s now create a box plot of day mins against customer service calls to understand how the customer service calls spans across day mins value. You can see that customer service calls don’t have a dependency or correlation on the day mins value.
From our observations, we can determine that the dataset is fairly balanced. We want the data to be evenly distributed across true and false values so that the model isn’t biased towards one value.
Based on our observations, we drop Phone column because it is just an account number and Day Charge, Eve Charge, Night Charge columns because they contain overlapping information such as the mins columns, but we can run a preview again to confirm.
After the data analysis and transformation, let’s preview the model again.
You can observe that the model estimated accuracy changed from 95.6% to 93.6% (this could vary), however the column impact (feature importance) for specific columns has changed considerably, which improves the speed of training as well as the columns’ influence on the prediction as we move to next steps of model building. Our dataset doesn’t require additional transformation, but if you needed to you could take advantage of ML data transforms to clean, transform, and prepare your data for model building.
You can now proceed to build a model and analyze results. For more information, refer to Predict customer churn with no-code machine learning using Amazon SageMaker Canvas.
To avoid incurring future session charges, log out of Canvas.
In this post, we showed how you can use Canvas visualization capabilities for EDA to better understand your data before model building, create accurate ML models, and generate predictions using a no-code, visual, point-and-click interface.
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of…
This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team…
At Definity, a leading Canadian P&C insurer with a history spanning over 150 years, we…
Don't expect to hear a lot about better framerates and raytracing at the Nvidia GTC…
The team working at the Social Security Administration appears to be among the largest DOGE…
Many companies invest heavily in hiring talent to create the high-performance library code that underpins…