Amazon SageMaker offers several ways to run distributed data processing jobs with Apache Spark, a popular distributed computing framework for big data processing.
You can run Spark applications interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive sessions, you can choose Apache Spark or Ray to easily process large datasets, without worrying about cluster management.
Alternately, if you need more control over the environment, you can use a pre-built SageMaker Spark container to run Spark applications as batch jobs on a fully managed distributed cluster with Amazon SageMaker Processing. This option allows you to select several types of instances (compute optimized, memory optimized, and more), the number of nodes in the cluster, and the cluster configuration, thereby enabling greater flexibility for data processing and model training.
Finally, you can run Spark applications by connecting Studio notebooks with Amazon EMR clusters, or by running your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).
All these options allow you to generate and store Spark event logs to analyze them through the web-based user interface commonly named the Spark UI, which runs a Spark History Server to monitor the progress of Spark applications, track resource usage, and debug errors.
In this post, we share a solution for installing and running Spark History Server on SageMaker Studio and accessing the Spark UI directly from the SageMaker Studio IDE, for analyzing Spark logs produced by different AWS services (AWS Glue Interactive Sessions, SageMaker Processing jobs, and Amazon EMR) and stored in an Amazon Simple Storage Service (Amazon S3) bucket.
The solution integrates Spark History Server into the Jupyter Server app in SageMaker Studio. This allows users to access Spark logs directly from the SageMaker Studio IDE. The integrated Spark History Server supports the following:
A utility command line interface (CLI) called sm-spark-cli
is also provided for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli
enables managing Spark History Server without leaving SageMaker Studio.
The solution consists of shell scripts that perform the following actions:
sm-spark-cli
for a user profile or shared spaceTo host Spark UI on SageMaker Studio, complete the following steps:
The commands will take a few seconds to complete.
sm-spark-cli
and access it from a web browser by running the following code:sm-spark-cli start s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>
The S3 location where the event logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are stored can be configured when running Spark applications.
For SageMaker Studio notebooks and AWS Glue Interactive Sessions, you can set up the Spark event log location directly from the notebook by using the sparkmagic
kernel.
The sparkmagic
kernel contains a set of tools for interacting with remote Spark clusters through notebooks. It offers magic (%spark
, %sql
) commands to run Spark code, perform SQL queries, and configure Spark settings like executor memory and cores.
For the SageMaker Processing job, you can configure the Spark event log location directly from the SageMaker Python SDK.
Refer to the AWS documentation for additional information:
You can choose the generated URL to access the Spark UI.
The following screenshot shows an example of the Spark UI.
You can check the status of the Spark History Server by using the sm-spark-cli status
command in the Studio System terminal.
You can also stop the Spark History Server when needed.
As an IT admin, you can automate the installation for SageMaker Studio users by using a lifecycle configuration. This can be done for all user profiles under a SageMaker Studio domain or for specific ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for more details.
You can create a lifecycle configuration from the install-history-server.sh script and attach it to an existing SageMaker Studio domain. The installation is run for all the user profiles in the domain.
From a terminal configured with the AWS Command Line Interface (AWS CLI) and appropriate permissions, run the following commands:
After Jupyter Server restarts, the Spark UI and the sm-spark-cli
will be available in your SageMaker Studio environment.
In this section, we show you how to clean up the Spark UI in a SageMaker Studio domain, either manually or automatically.
To manually uninstall the Spark UI in SageMaker Studio, complete the following steps:
To automatically uninstall the Spark UI in SageMaker Studio for all user profiles, complete the following steps:
In this post, we shared a solution you can use to quickly install the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine learning (ML) and data engineering teams can use scalable cloud compute to access and analyze Spark logs from anywhere and speed up their project delivery. IT admins can standardize and expedite the provisioning of the solution in the cloud and avoid proliferation of custom development environments for ML projects.
All the code shown as part of this post is available in the GitHub repository.
Understanding what's happening behind large language models (LLMs) is essential in today's machine learning landscape.
AI accelerationists have won as a consequence of the election, potentially sidelining those advocating for…
L'Oréal's first professional hair dryer combines infrared light, wind, and heat to drastically reduce your…
TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…
Whether a company begins with a proof-of-concept or live deployment, they should start small, test…
Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…