Amazon SageMaker Data Wrangler reduces the time that it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. You can import data from multiple data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Snowflake, and 26 federated query data sources supported by Amazon Athena.
Starting today, when importing data from Athena data sources, you can configure the S3 query output location and data retention period to import data in Data Wrangler to control where and how long Athena stores the intermediary data. In this post, we walk you through this new feature.
Athena is an interactive query service that makes it easy to browse the AWS Glue Data Catalog, and analyze data in Amazon S3 and 26 federated query data sources using standard SQL. When you use Athena to import data, you can use Data Wrangler’s default S3 location for the Athena query output, or specify an Athena workgroup to enforce a custom S3 location. Previously, you had to implement cleanup workflows to remove this intermediary data, or manually set up S3 lifecycle configuration to control storage cost and meet your organization’s data security requirements. This is a big operational overhead, and not scalable.
Data Wrangler now supports custom S3 locations and data retention periods for your Athena query output. With this new feature, you can change the Athena query output location to a custom S3 bucket. You now have a default data retention policy of 5 days for the Athena query output, and you can change this to meet your organization’s data security requirements. Based on the retention period, the Athena query output in the S3 bucket gets cleaned up automatically. After you import the data, you can perform exploratory data analysis on this dataset and store the clean data back to Amazon S3.
The following diagram illustrates this architecture.
For our use case, we use a sample bank dataset to walk through the solution. The workflow consists of the following steps:
For simplicity, we assume that you have already set up the Athena environment (steps 1–3). We detail the subsequent steps in this post.
To set up the Athena environment, refer to the User Guide for step-by-step instructions, and complete steps 1–3 as outlined in the previous section.
To import your data, complete the following steps:
s3://sagemaker-{region}-{account_id}/athena/
with a retention period of 5 days.You need s3:GetLifecycleConfiguration
and s3:PutLifecycleConfiguration
for your SageMaker execution role to correctly apply the lifecycle configuration policies. Without these permissions, you get error messages when you try to import the data.
The following error message is an example of missing the GetLifecycleConfiguration
permission.
The following error message is an example of missing the PutLifecycleConfiguration
permission.
After you load the data in to Data Wrangler, you can do exploratory data analysis (EDA) and prepare the data for machine learning.
bank-data
dataset in the data flow, and choose Add analysis.At the time of this writing, Data Wrangler provides over 300 built-in transformations. You can also write your own transformations using Pandas or PySpark.
You can now start building your transforms and analyses based on your business requirements.
To avoid ongoing costs, delete the Data Wrangler resources using the steps below when you’re finished.
sagemaker-data-wrangler-1.0 app
.In this post, we provided an overview of customizing your S3 location and enabling S3 lifecycle configurations for importing data from Athena to Data Wrangler. With this feature, you can store intermediary data in a secured S3 location, and automatically remove the data copy after the retention period to reduce the risk for unauthorized access to data. We encourage you to try out this new feature. Happy building!
To learn more about Athena and SageMaker, visit the Athena User Guide and Amazon SageMaker Documentation.
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…