ML 11298 image001
Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) that enables data scientists and developers to perform every step of the ML workflow, from preparing data to building, training, tuning, and deploying models. Studio comes with built-in integration with Amazon EMR so that data scientists can interactively prepare data at petabyte scale using open-source frameworks such as Apache Spark, Hive, and300 Presto right from within Studio notebooks. Data is often stored in data lakes managed by AWS Lake Formation, enabling you to apply fine-grained access control through a simple grant or revoke mechanism. We’re excited to announce that Studio now supports applying this fine-grained data access control with Lake Formation when accessing data through Amazon EMR.
Until now, when you ran multiple data processing jobs on an EMR cluster, all the jobs used the same AWS Identity and Access Management (IAM) role for accessing data—namely, the cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile. Therefore, to run jobs that needed access to different data sources such as different Amazon Simple Storage Service (Amazon S3) buckets, you had to configure the EC2 instance profile with policies that allowed access to the union of all such data sources. Additionally, for enabling groups of users with differential access to data, you had to create multiple separate clusters, one for each group, resulting in operational overheads. Separately, jobs submitted to Amazon EMR from Studio notebooks were unable to apply fine-grained data access control with Lake Formation.
Starting with the release of Amazon EMR 6.9, when you connect to EMR clusters from Studio notebooks, you can visually browse and choose an IAM role on the fly called the runtime IAM role. Subsequently, all your Apache Spark, Apache Hive, or Presto jobs created from Studio notebooks will access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with Lake Formation, you can enforce table-level and column-level access using policies attached to the runtime role.
With this new capability, multiple Studio users can connect to the same EMR cluster, each using a runtime IAM role scoped with permissions matching their individual level of access to data. Their user sessions are also completely isolated from one another on the shared cluster. With this ability to control fine-grained access to data on the same shared cluster, you can simplify provisioning of EMR clusters, thereby reducing operational overhead and saving costs.
In this post, we demonstrate how to use a Studio notebook to connect to an EMR cluster using runtime roles. We provide a sample Studio Lifecycle Configuration that can help configure the EMR runtime roles that a Studio user profile has access to. Additionally, we manage data access in a data lake via Lake Formation by enforcing row-level and column-level permissions to the EMR runtime roles.
We demonstrate this solution with an end-to-end use case using a sample dataset, the TPC data model. This data represents transaction data for products and includes information such as customer demographics, inventory, web sales, and promotions. To demonstrate fine-grained data access permissions, we consider the following two users:
The architecture is implemented as follows:
The following diagram illustrates this architecture.
The following sections walk through the steps required to enable runtime IAM roles for Amazon EMR integration with an existing Studio domain. You can use the provided AWS CloudFormation stack in the Deploy the solution section below to set up the architectural components for this solution.
Before you get started, make sure you have the following prerequisites:
The EMR cluster should be created with IAM runtime roles enabled. For more details on using runtime roles with Amazon EMR, see Configure runtime roles for Amazon EMR steps. Associating runtime roles with EMR clusters is supported in Amazon EMR 6.9. Make sure the following configuration is in place:
You can optionally choose to pass the SourceIdentity
(the Studio user profile name) for monitoring the user resource access. Follow the steps outlined in Monitoring user resource access from Amazon SageMaker Studio to enable SourceIdentity
for your Studio domain.
Finally, refer to Prepare Data using Amazon EMR for detailed setup and networking instructions on integrating Studio with EMR clusters.
You need to run a bootstrap action on the cluster to ensure Studio notebook’s connectivity with EMR through runtime roles. Complete the following steps:
s3://emr-data-access-control-<region>/customer-bootstrap-actions/gcsc/replace-rpms.sh
, replacing region with your regions3://emr-data-access-control-<region>/customer-bootstrap-actions/gcsc/emr-secret-agent-1.18.0-SNAPSHOT20221121212949.noarch.rpm
--bootstrap-actions "Path=<S3-URI-of-the-bootstrap-script>,Args=[<S3-URI-of-the-RPM-file>]"
Your Studio user’s execution role needs to be updated to allow the GetClusterSessionCredentials
API action. Add the following policy to the Studio execution role, replacing the resource with the cluster ARNs you wish to allow your users to connect to:
You can also use conditions to control which EMR execution roles can be used by the Studio execution role.
Alternatively, you can attach a role such as below, which restricts access to clusters based on resource tags. This allows for tag-based access control, and you can use the same policy statements across user roles, instead of explicitly adding cluster ARNs.
By default, the Studio UI uses the Studio execution role to connect to the EMR cluster. If your user can access multiple roles, they can update the EMR cluster connection commands with the role ARN they want to pass as a runtime role. For a better user experience, you can set up a configuration file on the user’s home directory on Amazon Elastic File System (Amazon EFS), which automatically informs the Studio UI of the roles that are available to connect for the user. You can also automate this process through Studio Lifecycle Configurations. We provide the following sample Lifecycle Configuration script to configure the roles:
After the role and Lifecycle Configuration scripts are set up, you can launch the Studio UI and connect to the clusters when you create a new notebook using any of the following kernels:
Note: The Studio UI for connecting to EMR clusters using runtime roles work only on JupyterLab version 3. See Jupyter versioning for details on upgrading to JL3.
To test out the solution end to end, we provide a CloudFormation template that sets up the services included in the architecture, to enable repeatable deployments. This template creates the following resources:
Marketing-data-access-role
Sales-data-access-role
Electronics-data-access-role
To deploy the solution, complete the following steps:
CN=*.ec2.internal
, as specified in the documentation here. Make sure to upload the zip file on a S3 bucket in the same region as your CloudFormation stack deployment.Once the Stack is created, allow Amazon EMR to query Lake Formation by updating the External Data Filtering settings on Lake Formation. Follow the instructions provided in the Lake Formation guide here, and choose ‘Amazon EMR’ for Session tag values, and enter your AWS account ID under AWS account IDs.
With the infrastructure in place, you’re ready to test out the fine-grained data access for the two Studio users. To recap, the user David should only be able to access non-sensitive customer data. Tina can access data in two tables: sales and product information. Let’s test each user profile.
To test your data access with David’s user profile, complete the following steps:
david-non-sensitive-customer
.The cluster is pre-created in the account.
<StackName>-emr-cluster
. In the role selector pop-up, choose the <StackName>-marketing-data-access-role
.Now let’s query the marketing table from the notebook.
After the cell runs successfully, you can view the first 10 records in the table. Note that you can’t view the customers’ name, as the user only has permissions to read non-sensitive data, through column-level filtering.
Let’s test to make sure David can’t read any sensitive customer data.
This cell should throw an Access Denied error.
Tina’s Studio execution role allows her to access the Lake Formation database using two EMR execution roles. This is achieved by listing the role ARNs in a configuration file in Tina’s file directory. These roles can be set using Studio Lifecycle Configurations to persist the roles across app restarts. To test Tina’s access, complete the following steps:
tina-sales-electronics
.It’s a good practice to close any previous Studio sessions on your browser when switching user profiles. There can only be one active Studio user session at a time.
<StackName>-emr-cluster
.Because Tina’s profile is set up with multiple EMR roles, you’re prompted with a UI drop-down that allows you to connect using multiple roles.
The Studio execution role is also available in the dropdown, as the clusters connect using the user’s execution role by default to connect to the cluster.
You can directly provide Lake Formation access to the user’s execution role as well.This will automatically create a notebook cell with magic commands to connect to the cluster, using the chosen role.Now let’s query the sales table from the notebook.
After the cell runs successfully, you can view the first 10 records in the table.
Now let’s try accessing the product table.
<StackName>-electronics-data-access-role
and connect to the cluster.This cell should complete successfully, and you can view the first 10 records in the products table.
With a single Studio user profile, you have now successfully assumed multiple roles, and queried data in Lake Formation using multiple roles, without the need for restarting the notebooks or creating additional clusters. Now that you’re able to access the data using appropriate roles, you can interactively explore the data, visualize the data, and prepare data for training. You also used different user profiles to provide your users in different teams access to a specific table or columns and rows, without the need for additional clusters.
When you’re finished experimenting with this solution, clean up your resources:
The EMR cluster will be automatically deleted after the idle timeout value.
This post showed you how you can use runtime roles to connect Studio with Amazon EMR to apply fine-grained data access control with Lake Formation. We also demonstrated how multiple Studio users canconnect to the same EMR cluster, each using a runtime IAM role scoped with permissions matching their individual level of access to data. We detailed the steps required to manually set up the integration, and provided a CloudFormation template to set up the infrastructure end to end. This feature is available in the following AWS regions: Europe (Paris), US East (N. Virginia and Ohio) and US West (Oregon), and the CloudFormation template will deploy in US East (N. Virginia and Ohio) and US West (Oregon).
To learn more about using EMR with SageMaker Studio, visit Prepare Data using Amazon EMR. We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!
Our next iteration of the FSF sets out stronger security protocols on the path to…
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this…
Generative AI has revolutionized technology through generating content and solving complex problems. To fully take…
At Google Cloud, we're deeply invested in making AI helpful to organizations everywhere — not…
Advanced Micro Devices reported revenue of $7.658 billion for the fourth quarter, up 24% from…