ML 18027 1 1
When implementing machine learning (ML) workflows in Amazon SageMaker Canvas, organizations might need to consider external dependencies required for their specific use cases. Although SageMaker Canvas provides powerful no-code and low-code capabilities for rapid experimentation, some projects might require specialized dependencies and libraries that aren’t included by default in SageMaker Canvas. This post provides an example of how to incorporate code that relies on external dependencies into your SageMaker Canvas workflows.
Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides users through every stage of the ML journey, from initial data preparation to final model deployment. Without writing a single line of code, users can explore datasets, transform data, build models, and generate predictions.
SageMaker Canvas offers comprehensive data wrangling capabilities that help you prepare your data, including:
In this post, we demonstrate how to incorporate dependencies stored in Amazon Simple Storage Service (Amazon S3) within an Amazon SageMaker Data Wrangler flow. Using this approach, you can run custom scripts that depend on modules not inherently supported by SageMaker Canvas.
To showcase the integration of custom scripts and dependencies from Amazon S3 into SageMaker Canvas, we explore the following example workflow.
The solution follows three main steps:
The following diagram is the architecture for the solution.
In this example, we work with two complementary datasets available in SageMaker Canvas that contain shipping information for computer screen deliveries. By joining these datasets, we create a comprehensive dataset that captures various shipping metrics and delivery outcomes. Our goal is to build a predictive model that can determine whether future shipments will arrive on time based on historical shipping patterns and characteristics.
As a prerequisite, you need access to Amazon S3 and Amazon SageMaker AI. If you don’t already have a SageMaker AI domain configured in your account, you also need permissions to create a SageMaker AI domain.
To create the data flow, follow these steps:
The initial data flow will open with one source and one data type.
The dataset contains XShippingDistance (long) and YShippingDistance (long) columns. For our purposes, we want to use a custom function that will find the total distance using the X and Y coordinates and then drop the individual coordinate columns. For this example, we find the total distance using a function that relies on the mpmath library.
Running the function produces the following error: ModuleNotFoundError: No module named ‘mpmath’, as shown in the following screenshot.
This error occurs because mpmath isn’t a module that is inherently supported by SageMaker Canvas. To use a function that relies on this module, we need to approach the use of a custom function differently.
To use a function that relies on a module that isn’t natively supported in Canvas, the custom script must be zipped with the module(s) it relies on. For this example, we used our local integrated development environment (IDE) to create a script.py that relies on the mpmath library.
The script.py file contains two functions: one function that is compatible with the Python (Pandas) runtime (function calculate_total_distance
), and one that is compatible with the Python (Pyspark) runtime (function udf_total_distance
).
To make sure the script can run, install mpmath into the same directory as script.py by running pip install mpmath
.
Run zip -r my_project.zip
to create a .zip file containing the function and the mpmath installation. The current directory now contains a .zip file, our Python script, and the installation our script depends on, as shown in the following screenshot.
After creating the .zip file, upload it to an Amazon S3 bucket.
After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.
Return to the data flow in SageMaker Canvas and replace the prior custom function code with the following code and choose Update.
This example code unzips the .zip file and adds the required dependencies to the local path so they’re available to the function at run time. Because mpmath was added to the local path, you can now call a function that relies on this external library.
The preceding code runs using the Python (Pandas) runtime and calculate_total_distance function. To use the Python (Pyspark) runtime, update the function_name variable to call the udf_total_distance function instead.
As a last step, remove irrelevant columns before training the model. Follow these steps:
The final dataset should contain 13 columns. The complete data flow is pictured in the following image.
To train the model, follow these steps:
When building the model you can choose to run a Quick build or a Standard build. A Quick build prioritizes speed over accuracy and produces a trained model in less than 20 minutes. A standard build prioritizes accuracy over latency but the model takes longer to train.
After the model build is complete, you can view the model’s accuracy, along with metrics like F1, precision and recall. In the case of a standard build, the model achieved 94.5% accuracy.
After the model training is complete, there are four ways you can use your model:
To manage costs and prevent additional workspace charges, choose Log out to sign out of SageMaker Canvas when you’re done using the application, as shown in the following screenshot. You can also configure SageMaker Canvas to automatically shut down when idle.
If you created an S3 bucket specifically for this example, you might also want to empty and delete your bucket.
In this post, we demonstrated how you can upload custom dependencies to Amazon S3 and integrate them into SageMaker Canvas workflows. By walking through a practical example of implementing a custom distance calculation function with the mpmath library, we showed how to:
This approach means that data scientists and analysts can extend SageMaker Canvas capabilities beyond the more than 300 included functions.
To try custom transforms yourself, refer to the Amazon SageMaker Canvas documentation and sign in to SageMaker Canvas today. For additional insights into how you can optimize your SageMaker Canvas implementation, we recommend exploring these related posts:
Hey everyone! I came across this interesting photo and I'm really curious—what kind of AI…
While quantum computers will be able to break traditional encryption, we’re still a long way…
submitted by /u/balianone [link] [comments]
Enterprises adopting AI aren’t just signing a “utility contract” for revenue growth; they’re entering an…
Martha Stewart–endorsed Dinnerly is a budget meal kit that often feels homespun and extravagant. Now…