In the modern, cloud-centric business landscape, data is often scattered across numerous clouds and on-site systems. This fragmentation can complicate efforts by organizations to consolidate and analyze data for their machine learning (ML) initiatives.
This post presents an architectural approach to extract data from different cloud environments, such as Google Cloud Platform (GCP) BigQuery, without the need for data movement. This minimizes the complexity and overhead associated with moving data between cloud environments, enabling organizations to access and utilize their disparate data assets for ML projects.
We highlight the process of using Amazon Athena Federated Query to extract data from GCP BigQuery, using Amazon SageMaker Data Wrangler to perform data preparation, and then using the prepared data to build ML models within Amazon SageMaker Canvas, a no-code ML interface.
SageMaker Canvas allows business analysts to access and import data from over 50 sources, prepare data using natural language and over 300 built-in transforms, build and train highly accurate models, generate predictions, and deploy models to production without requiring coding or extensive ML experience.
The solution outlines two main steps:
After the data is imported into SageMaker Canvas, you can use the no-code interface to build ML models and generate predictions based on the imported data.
You can use SageMaker Canvas to build the initial data preparation routine and generate accurate predictions without writing code. However, as your ML needs evolve or require more advanced customization, you may want to transition from a no-code environment to a code-first approach. The integration between SageMaker Canvas and Amazon SageMaker Studio allows you to operationalize the data preparation routine for production-scale deployments. For more details, refer to Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio
The overall architecture, as seen below, demonstrates how to use AWS services to seamlessly access and integrate data from a GCP BigQuery data warehouse into SageMaker Canvas for building and deploying ML models.
The workflow includes the following steps:
This solution offers the following benefits:
In the next sections, we dive deeper into the technical implementation details and walk through a step-by-step demonstration of this solution.
The steps outlined in this post provide an example of how to import data into SageMaker Canvas for no-code ML. In this example, we demonstrate how to import data through Athena from GCP BigQuery.
For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. This sample dataset contains 5,000 records, where each record uses 21 attributes to describe the customer profile. The Churn column in the dataset indicates whether the customer left service (true/false). This Churn attribute is the target variable that the ML model should aim to predict.
The following screenshot shows an example of the dataset on the BigQuery console.
Complete the following prerequisite steps:
glue:GetDatabase
and athena:GetDataCatalog
permission on the resource. See the following example: Complete the following steps to set up the Athena data source connector:
You’re returned to the Enter data source details page.
For more information about tags, see Tagging Athena resources.
The Data source details section of the page for your data source shows information about your new connector. You can now use the connector in your Athena queries. For information about using data connectors in queries, see Running federated queries.
To query from Athena, launch the Athena SQL editor and choose the data source you created. You should be able to run live queries against the BigQuery database.
To import data from Athena, complete the following steps:
SageMaker Data Wrangler in SageMaker Canvas allows you to prepare, featurize, and analyze your data. You can integrate a SageMaker Data Wrangler data preparation flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding.
In the preceding query, bigquery
is the data source name created in Athena, athenabigquery
is the database name, and customer_churn
is the table name.
When working with ML, it’s crucial to randomize or shuffle the dataset. This step is essential because you may have access to millions or billions of data points, but you don’t necessarily need to use the entire dataset for training the model. Instead, you can limit the data to a smaller subset specifically for training purposes. After you’ve shuffled and prepared the data, you can begin the iterative process of data preparation, feature evaluation, model training, and ultimately hosting the trained model.
The data is imported into SageMaker Canvas as a dataset from the specific table in Athena. You can now use this dataset to create a model.
After your data is imported, it shows up on the Datasets page in SageMaker Canvas. At this stage, you can build a model. To do so, complete the following steps:
my_first_model
).SageMaker Canvas enables you to create models for predictive analysis, image analysis, and text analysis.
On the Build page, you can see statistics about your dataset, such as the percentage of missing values and mode of the data.
churn
).SageMaker Canvas offers two types of models that can generate predictions. Quick build prioritizes speed over accuracy, providing a model in 2–15 minutes. Standard build prioritizes accuracy over speed, providing a model in 30 minutes–2 hours.
After the model is trained, you can analyze the model accuracy.
The Overview tab shows us the column impact, or the estimated importance of each column in predicting the target column. In this example, the Night_calls
column has the most significant impact in predicting if a customer will churn. This information can help the marketing team gain insights that lead to taking actions to reduce customer churn. For example, we can see that both low and high CustServ_Calls
increase the likelihood of churn. The marketing team can take actions to help prevent customer churn based on these learnings. Examples include creating a detailed FAQ on websites to reduce customer service calls, and running education campaigns with customers on the FAQ that can keep engagement up.
On the Predict tab, you can generate both batch predictions and single predictions. Complete the following steps to generate a batch prediction:
SageMaker Canvas allows you to generate batch predictions either manually or automatically on a schedule. To learn how to automate batch predictions on a schedule, refer to Manage automations.
After a few seconds, the prediction is complete, and you can choose View to see the prediction.
Optionally, choose Download to download a CSV file containing the full output. SageMaker Canvas will return a prediction for each row of data and the probability of the prediction being correct.
Optionally, you can deploy your models to an endpoint to make predictions. For more information, refer to Deploy your models to an endpoint.
To avoid future charges, log out of SageMaker Canvas.
In this post, we showcased a solution to extract the data from BigQuery using Athena federated queries and a sample dataset. We then used the extracted data to build an ML model using SageMaker Canvas to predict customers at risk of churning—without writing code. SageMaker Canvas enables business analysts to build and deploy ML models effortlessly through its no-code interface, democratizing ML across the organization. This enables you to harness the power of advanced analytics and ML to drive business insights and innovation, without the need for specialized technical skills.
For more information, see Query any data source with Amazon Athena’s new federated query and Import data from over 40 data sources for no-code machine learning with Amazon SageMaker Canvas. If you’re new to SageMaker Canvas, refer to Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas.
This post is co-written with Steven Craig from Hearst. To maintain their competitive edge, organizations…
Conspiracy theories about missing votes—which are not, in fact, missing—and something being “not right” are…
Researchers have developed AI-driven mobile robots that can carry out chemical synthesis research with extraordinary…
In recent years, roboticists have introduced robotic systems that can complete missions in various environments,…
Overwhelmed by manual tasks and data overload? Streamline your business and boost revenue with the…
In real life, the machine learning model is not a standalone object that only produces…