Prepare data from Amazon EMR for machine learning using Amazon SageMaker Data Wrangler

Data preparation is a principal component of machine learning (ML) pipelines. In fact, it is estimated that data professionals spend about 80 percent of their time on data preparation. In this intensive competitive market, teams want to analyze data and extract more meaningful insights quickly. Customers are adopting more efficient and visual ways to build data processing systems.

Amazon SageMaker Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes from weeks to minutes by providing a single visual interface for data scientists to select, clean data, create features, and automate data preparation in ML workflows without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. You can now also use Amazon EMR as a data source in Data Wrangler to easily prepare data for ML.

Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Data professionals such as data scientists want to leverage the power of Apache Spark, Hive, and Presto running on Amazon EMR for fast data preparation, but the learning curve is steep. Our customers wanted the ability to connect to Amazon EMR to run ad hoc SQL queries on Hive or Presto to query data in the internal metastore or external metastore (e.g., AWS Glue Data Catalog), and prepare data within a few clicks.

This blog article will discuss how customers can now find and connect to existing Amazon EMR clusters using a visual experience in SageMaker Data Wrangler. They can visually inspect the database, tables, schema, and Presto queries to prepare for modeling or reporting. They can then quickly profile data using a visual interface to assess data quality, identify abnormalities or missing or erroneous data, and receive information and recommendations on how to address these issues. Additionally, they can analyze, clean, and engineer features with the aid of more than a dozen additional built-in analyses and 300+ extra built-in transformations backed by Spark without writing a single line of code.

Solution overview

Data professionals can quickly find and connect to existing EMR clusters using SageMaker Studio configurations. Additionally, data professionals can terminate EMR clusters with only a few clicks from SageMaker Studio using predefined templates and on-demand creation of EMR clusters. With the help of these tools, customers may jump right into the SageMaker Studio universal notebook and write code in Apache Spark, Hive, Presto, or PySpark to perform data preparation at scale. Due to a steep learning curve for creating Spark code to prepare data, not all data professionals are comfortable with this procedure. With Amazon EMR as a data source for Amazon SageMaker Data Wrangler, you can now quickly and easily connect to Amazon EMR without writing a single line of code.

The following diagram represents the different components used in this solution.

We demonstrate two authentication options that can be used to establish a connection to the EMR cluster. For each option, we deploy a unique stack of AWS CloudFormation templates.

The CloudFormation template performs the following actions when each option is selected:

Creates a Studio Domain in VPC-only mode, along with a user profile named studio-user.
Creates building blocks, including the VPC, endpoints, subnets, security groups, EMR cluster, and other required resources to successfully run the examples.
For the EMR cluster, connects the AWS Glue Data Catalog as metastore for EMR Hive and Presto, creates a Hive table in EMR, and fills it with data from a US airport dataset.
For the LDAP CloudFormation template, creates an Amazon Elastic Compute Cloud (Amazon EC2) instance to host the LDAP server to authenticate the Hive and Presto LDAP user.

Option 1: Lightweight Access Directory Protocol

For the LDAP authentication CloudFormation template, we provision an Amazon EC2 instance with an LDAP server and configure the EMR cluster to use this server for authentication. This is TLS Enabled.

Option 2: No-Auth

In the No-Auth authentication CloudFormation template, we use a standard EMR cluster with no authentication enabled.

Deploy the resources with AWS CloudFormation

Complete the following steps to deploy the environment:

Sign in to the AWS Management Console as an AWS Identity and Access Management (IAM) user, preferably an admin user.
Choose Launch Stack to launch the CloudFormation template for the appropriate authentication scenario. Make sure the Region used to deploy the CloudFormation stack has no existing Studio Domain. If you already have a Studio Domain in a Region, you may choose a different Region.
- LDAP Launch Stack
- No Auth Launch Stack
Choose Next.
For Stack name, enter a name for the stack (for example, dw-emr-blog).
Leave the other values as default.
To continue, choose Next from the stack details page and stack options. The LDAP stack uses the following credentials:
- username: david
- password: welcome123
On the review page, select the check box to confirm that AWS CloudFormation might create resources.
Choose Create stack. Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE. The process usually takes 10–15 minutes.

Note: If you would like to try multiple stacks, please follow the steps in the Clean up section. Remember that you must delete the SageMaker Studio Domain before the next stack can be successfully launched.

Set up the Amazon EMR as a data source in Data Wrangler

In this section, we cover connecting to the existing Amazon EMR cluster created through the CloudFormation template as a data source in Data Wrangler.

Create a new data flow

To create your data flow, complete the following steps:

On the SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
Choose Open studio.
In the Launcher, choose New data flow. Alternatively, on the File drop-down, choose New, then choose Data Wrangler flow.
Creating a new flow can take a few minutes. After the flow has been created, you see the Import data page.

Add Amazon EMR as a data source in Data Wrangler

On the Add data source menu, choose Amazon EMR.

You can browse all the EMR clusters that your Studio execution role has permissions to see. You have two options to connect to a cluster; one is through interactive UI, and the other is to first create a secret using AWS Secrets Manager with JDBC URL, including EMR cluster information, and then provide the stored AWS secret ARN in the UI to connect to Presto. In this blog, we follow the first option. Select one of the following clusters that you want to use. Click on Next, and select endpoints.

Select Presto, connect to Amazon EMR, create a name to identify your connection, and click Next.

Select Authentication type, either LDAP or No Authentication, and click Connect.

For Lightweight Directory Access Protocol (LDAP), provide username and password to be authenticated.

For No Authentication, you will be connected to EMR Presto without providing user credentials within VPC. Enter Data Wrangler’s SQL explorer page for EMR.

Once connected, you can interactively view a database tree and table preview or schema. You can also query, explore, and visualize data from EMR. For preview, you would see a limit of 100 records by default. For customized query, you can provide SQL statements in the query editor box and once you click the Run button, the query will be executed on EMR’s Presto engine.

The Cancel query button allows ongoing queries to be canceled if they are taking an unusually long time.

The last step is to import. Once you are ready with the queried data, you have options to update the sampling settings for the data selection according to the sampling type (FirstK, Random, or Stratified) and sampling size for importing data into Data Wrangler.

Click Import. The prepare page will be loaded, allowing you to add various transformations and essential analysis to the dataset.

Navigate to DataFlow from the top screen and add more steps to the flow as needed for transformations and analysis. You can run a data insight report to identify data quality issues and get recommendations to fix those issues. Let’s look at some example transforms.

Go to your dataflow, and this is the screen that you should see. It shows us that we are using EMR as a data source using the Presto connector.

Let’s click on the + button to the right of Data types and select Add transform. When you do that, the following screen should pop up:

Let’s explore the data. We see that it has multiple features such as iata_code, airport, city, state, country, latitude, and longitude. We can see that the entire dataset is based in one country, which is the US, and there are missing values in Latitude and Longitude. Missing data can cause bias in the estimation of parameters, and it can reduce the representativeness of the samples, so we need to perform some imputation and handle missing values in our dataset.

Let’s click on the Add Step button on the navigation bar to the right. Select Handle missing. The configurations can be seen in the following screenshots. Under Transform, select Impute. Select the column type as Numeric and column names Latitude and Longitude. We will be imputing the missing values using an approximate median value. Preview and add the transform.

Let us now look at another example transform. When building a machine learning model, columns are removed if they are redundant or don’t help your model. The most common way to remove a column is to drop it. In our dataset, the feature country can be dropped since the dataset is specifically for US airport data. Let’s see how we can manage columns. Let’s click on the Add step button on the navigation bar to the right. Select Manage columns. The configurations can be seen in the following screenshots. Under Transform, select Drop column, and under Columns to drop, select Country.

You can continue adding steps based on the different transformations required for your dataset. Let us go back to our data flow. You will now see two more blocks showing the transforms that we performed. In our scenario, you can see Impute and Drop column.

ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project will lead to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity because different teams often run identical jobs or even write duplicate feature engineering code because they have no knowledge of prior work. To avoid the reprocessing of features, we will now export our transformed features to Amazon Feature Store. Let’s click on the + button to the right of Drop column. Select Export to and choose Sagemaker Feature Store (via Jupyter notebook).

You can easily export your generated features to SageMaker Feature Store by selecting it as the destination. You can save the features into an existing feature group or create a new one.

We have now created features with Data Wrangler and easily stored those features in Feature Store. We showed an example workflow for feature engineering in the Data Wrangler UI. Then we saved those features into Feature Store directly from Data Wrangler by creating a new feature group. Finally, we ran a processing job to ingest those features into Feature Store. Data Wrangler and Feature Store together helped us build automatic and repeatable processes to streamline our data preparation tasks with minimum coding required. Data Wrangler also provides us ﬂexibility to automate the same data preparation ﬂow using scheduled jobs. We can also automate training or feature engineering with SageMaker Pipelines (via Jupyter Notebook) and deploy to the Inference endpoint with SageMaker inference pipeline (via Jupyter Notebook).

Clean up

If your work with Data Wrangler is complete, select the stack created from the CloudFormation page and delete it to avoid incurring additional fees.

Conclusion

In this post, we went over how to set up Amazon EMR as a data source in Data Wrangler, how to transform and analyze a dataset, and how to export the results to a data flow for use in a Jupyter notebook. After visualizing our dataset using Data Wrangler’s built-in analytical features, we further enhanced our data flow. The fact that we created a data preparation pipeline without writing a single line of code is significant.

To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler, and see the latest information on the Data Wrangler product page.

About the authors

Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.

Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.

Rui Jiang is a Software Development Engineer at AWS based in the New York City area. She is a member of the SageMaker Data Wrangler team helping develop engineering solutions for AWS enterprise customers to achieve their business needs. Outside of work, she enjoys exploring new foods, life fitness, outdoor activities, and traveling.