Integrate HyperPod clusters with Active Directory for seamless multi-user login

Amazon SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption.

Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files, run their own jobs, and want to avoid impacting each other’s work. To achieve this multi-user environment, you can take advantage of Linux’s user and group mechanism and statically create multiple users on each instance through lifecycle scripts. The drawback to this approach, however, is that user and group settings are duplicated across multiple instances in the cluster, making it difficult to configure them consistently on all instances, such as when a new team member joins.

To solve this pain point, we can use Lightweight Directory Access Protocol (LDAP) and LDAP over TLS/SSL (LDAPS) to integrate with a directory service such as AWS Directory Service for Microsoft Active Directory. With the directory service, you can centrally maintain users and groups, and their permissions.

In this post, we introduce a solution to integrate HyperPod clusters with AWS Managed Microsoft AD, and explain how to achieve a seamless multi-user login environment with a centrally maintained directory.

Solution overview

The solution uses the following AWS services and resources:

We also use AWS CloudFormation to deploy a stack to create the prerequisites for the HyperPod cluster: VPC, subnets, security group, and Amazon FSx for Lustre volume.

The following diagram illustrates the high-level solution architecture.

Architecture diagram for HyperPod and Active Directory integration

In this solution, HyperPod cluster instances use the LDAPS protocol to connect to the AWS Managed Microsoft AD via an NLB. We use TLS termination by installing a certificate to the NLB. To configure LDAPS in HyperPod cluster instances, the lifecycle script installs and configures System Security Services Daemon (SSSD)—an open source client software for LDAP/LDAPS.

Prerequisites

This post assumes you already know how to create a basic HyperPod cluster without SSSD. For more details on how to create HyperPod clusters, refer to Getting started with SageMaker HyperPod and the HyperPod workshop.

Also, in the setup steps, you will use a Linux machine to generate a self-signed certificate and obtain an obfuscated password for the AD reader user. If you don’t have a Linux machine, you can create an EC2 Linux instance or use AWS CloudShell.

Create a VPC, subnets, and a security group

Follow the instructions in the Own Account section of the HyperPod workshop. You will deploy a CloudFormation stack and create prerequisite resources such as VPC, subnets, security group, and FSx for Lustre volume. You need to create both a primary subnet and backup subnet when deploying the CloudFormation stack, because AWS Managed Microsoft AD requires at least two subnets with different Availability Zones.

In this post, for simplicity, we use the same VPC, subnets, and security group for both the HyperPod cluster and directory service. If you need to use different networks between the cluster and directory service, make sure security groups and route tables are configured so that they can communicate each other.

Create AWS Managed Microsoft AD on Directory Service

Complete the following steps to set up your directory:

  1. On the Directory Service console, choose Directories in the navigation pane.
  2. Choose Set up directory.
  3. For Directory type, select AWS Managed Microsoft AD.
  4. Choose Next.
    Directory type selection screen
  5. For Edition, select Standard Edition.
  6. For Directory DNS name, enter your preferred directory DNS name (for example, hyperpod.abc123.com).
  7. For Admin password¸ set a password and save it for later use.
  8. Choose Next.
    Directory creation configuration screen
  9. In the Networking section, specify the VPC and two private subnets you created.
  10. Choose Next.
    Directory network configuration screen
  11. Review the configuration and pricing, then choose Create directory.
    Directory creation confirmation screen
    The directory creation starts. Wait until the status changes from Creating to Active, which can take 20–30 minutes.
  12. When the status changes to Active, open the detail page of the directory and take note of the DNS addresses for later use.Directory details screen

Create an NLB in front of Directory Service

To create the NLB, complete the following steps:

  1. On the Amazon EC2 console, choose Target groups in the navigation pane.
  2. Choose Create target groups.
  3. Create a target group with the following parameters:
    1. For Choose a target type, select IP addresses.
    2. For Target group name, enter LDAP.
    3. For Protocol: Port, choose TCP and enter 389.
    4. For IP address type, select IPv4.
    5. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    6. For Health check protocol, choose TCP.
  4. Choose Next.
    Load balancing target creation configuration screen
  5. In the Register targets section, register the directory service’s DNS addresses as the targets.
  6. For Ports, choose Include as pending below.Load balancing target registration screenThe addresses are added in the Review targets section with Pending status.
  7. Choose Create target group.Load balancing target review screen
  8. On the Load Balancers console, choose Create load balancer.
  9. Under Network Load Balancer, choose Create.Load balancer type choosing screen
  10. Configure an NLB with the following parameters:
    1. For Load balancer name, enter a name (for example, nlb-ds).
    2. For Scheme, select Internal.
    3. For IP address type, select IPv4.NLB creation basic configuration section
    4. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    5. Under Mappings, select the two private subnets and their CIDR ranges (which you created with the CloudFormation template).
    6. For Security groups, choose CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).NLB creation network mapping and security groups configurations
  11. In the Listeners and routing section, specify the following parameters:
    1. For Protocol, choose TCP.
    2. For Port, enter 389.
    3. For Default action, choose the target group named LDAP.

    Here, we are adding a listener for LDAP. We will add LDAPS later.

  12. Choose Create load balancer.NLB listeners routing configuration screenWait until the status changes from Provisioning to Active, which can take 3–5 minutes.
  13. When the status changes to Active, open the detail page of the provisioned NLB and take note of the DNS name (xyzxyz.elb.region-name.amazonaws.com) for later use.NLB details screen

Create a self-signed certificate and import it to Certificate Manager

To create a self-signed certificate, complete the following steps:

  1. On your Linux-based environment (local laptop, EC2 Linux instance, or CloudShell), run the following OpenSSL commands to create a self-signed certificate and private key:
    $ openssl genrsa 2048 > ldaps.key
    
    $ openssl req -new -key ldaps.key -out ldaps_server.csr
    
    You are about to be asked to enter information that will be incorporated
    into your certificate request.
    What you are about to enter is what is called a Distinguished Name or a DN.
    There are quite a few fields but you can leave some blank
    For some fields there will be a default value,
    If you enter '.', the field will be left blank.
    -----
    Country Name (2 letter code) [AU]:US
    State or Province Name (full name) [Some-State]:Washington
    Locality Name (eg, city) []:Bellevue
    Organization Name (eg, company) [Internet Widgits Pty Ltd]:CorpName
    Organizational Unit Name (eg, section) []:OrgName
    Common Name (e.g., server FQDN or YOUR name) []:nlb-ds-abcd1234.elb.region.amazonaws.com
    Email Address []:your@email.address.com
    
    Please enter the following 'extra' attributes
    to be sent with your certificate request
    A challenge password []:
    An optional company name []:
    
    $ openssl x509 -req -sha256 -days 365 -in ldaps_server.csr -signkey ldaps.key -out ldaps.crt
    
    Certificate request self-signature ok
    subject=C = US, ST = Washington, L = Bellevue, O = CorpName, OU = OrgName, CN = nlb-ds-abcd1234.elb.region.amazonaws.com, emailAddress = your@email.address.com
    
    $ chmod 600 ldaps.key

  2. On the Certificate Manager console, choose Import.
  3. Enter the certificate body and private key, from the contents of ldaps.crt and ldaps.key respectively.
  4. Choose Next.Certificate importing screen
  5. Add any optional tags, then choose Next.Certificate tag editing screen
  6. Review the configuration and choose Import.Certificate import review screen

Add an LDAPS listener

We added a listener for LDAP already in the NLB. Now we add a listener for LDAPS with the imported certificate. Complete the following steps:

  1. On the Load Balancers console, navigate to the NLB details page.
  2. On the Listeners tab, choose Add listener.NLB listers screen with add listener button
  3. Configure the listener with the following parameters:
    1. For Protocol, choose TLS.
    2. For Port, enter 636.
    3. For Default action, choose LDAP.
    4. For Certificate source, select From ACM.
    5. For Certificate, enter what you imported in ACM.
  4. Choose Add.NLB listener configuration screenNow the NLB listens to both LDAP and LDAPS. It is recommended to delete the LDAP listener because it transmits data without encryption, unlike LDAPS.NLB listerners list with LDAP and LDAPS

Create an EC2 Windows instance to administer users and groups in the AD

To create and maintain users and groups in the AD, complete the following steps:

  1. On the Amazon EC2 console, choose Instances in the navigation pane.
  2. Choose Launch instances.
  3. For Name, enter a name for your instance.
  4. For Amazon Machine Image, choose Microsoft Windows Server 2022 Base.
  5. For Instance type, choose t2.micro.
  6. In the Network settings section, provide the following parameters:
    1. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. For Subnet, choose either of two subnets you created with the CloudFormation template.
    3. For Common security groups, choose CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).
  7. For Configure storage, set storage to 30 GB gp2.
  8. In the Advanced details section, for Domain join directory¸ choose the AD you created.
  9. For IAM instance profile, choose an AWS Identity and Access Management (IAM) role with at least the AmazonSSMManagedEC2InstanceDefaultPolicy policy.
  10. Review the summary and choose Launch instance.

Create users and groups in AD using the EC2 Windows instance

With Remote Desktop, connect to the EC2 Windows instance you created in the previous step. Using an RDP client is recommended over using a browser-based Remote Desktop so that you can exchange the contents of the clipboard with your local machine using copy-paste operations. For more details about connecting to EC2 Windows instances, refer to Connect to your Windows instance.

If you are prompted for a login credential, use hyperpodAdmin (where hyperpod is the first part of your directory DNS name) as the user name, and use the admin password you set to the directory service.

  1. When the Windows desktop screen opens, choose Server Manager from the Start menu.Dashboard screen on Server Manager
  2. Choose Local Server in the navigation pane, and confirm that the domain is what you specified to the directory service.Local Server screen on Server Manager
  3. On the Manage menu, choose Add Roles and Features.Drop down menu opened from Manage button
  4. Choose Next until you are at the Features page.Add Roles and Features Wizard
  5. Expand the feature Remote Server Administration Tools, expand Role Administration Tools, and select AD DS and AD LDS Tools and Active Directory Rights Management Service.
  6. Choose Next and Install.Features selection screenFeature installation starts.
  7. When the installation is complete, choose Close.Feature installation progress screen
  8. Open Active Directory Users and Computers from the Start menu.Active Directory Users and Computers window
  9. Under hyperpod.abc123.com, expand hyperpod.
  10. Choose (right-click) hyperpod, choose New, and choose Organizational Unit.Context menu opened to create an Organizational Unit
  11. Create an organizational unit called Groups.Organizational Unit ceation dialog
  12. Choose (right-click) Groups, choose New, and choose Group.Context menu opened to create groups
  13. Create a group called ClusterAdmin.Group creation dialog for ClusterAdmin
  14. Create a second group called ClusterDev.Group creation dialog for ClusterDev
  15. Choose (right-click) Users, choose New, and choose User.
  16. Create a new user.User creation dialog
  17. Choose (right-click) the user and choose Add to a group.Context menu opened to add a user to a group
  18. Add your users to the groups ClusterAdmin or ClusterDev.Group selection screen to add a user to a groupUsers added to the ClusterAdmin group will have sudo privilege on the cluster.

Create a ReadOnly user in AD

Create a user called ReadOnly under Users. The ReadOnly user is used by the cluster to programmatically access users and groups in AD.

User creation dialog to create ReadOnly user

Take note of the password for later use.

Password entering screen for ReadOnly user

(For SSH public key authentication) Add SSH public keys to users

By storing an SSH public key to a user in AD, you can log in without entering a password. You can use an existing key pair, or you can create a new key pair with OpenSSH’s ssh-keygen command. For more information about generating a key pair, refer to Create a key pair for your Amazon EC2 instance.

  1. In Active Directory Users and Computers, on the View menu, enable Advanced Features.View menu opened to enable Advanced Features
  2. Open the Properties dialog of the user.
  3. On the Attribute Editor tab, choose altSecurityIdentities choose Edit.Attribute Editor tab on User Properties dialog
  4. For Value to add, choose Add.
  5. For Values, add an SSH public key.
  6. Choose OK.Attribute editing dialog for altSecurityIdentitiesConfirm that the SSH public key appears as an attribute.Attribute Editor tab with altSecurityIdentities configured

Get an obfuscated password for the ReadOnly user

To avoid including a plain text password in the SSSD configuration file, you obfuscate the password. For this step, you need a Linux environment (local laptop, EC2 Linux instance, or CloudShell).

Install the sssd-tools package on the Linux machine to install the Python module pysss for obfuscation:

# Ubuntu
$ sudo apt install sssd-tools

# Amazon Linux
$ sudo yum install sssd-tools

Run the following one-line Python script. Input the password of the ReadOnly user. You will get the obfuscated password.

$ python3 -c "import getpass,pysss; print(pysss.password().encrypt(getpass.getpass('AD reader user password: ').strip(), pysss.password().AES_256))"
AD reader user password: (Enter ReadOnly user password) 
AAAQACK2....

Create a HyperPod cluster with an SSSD-enabled lifecycle script

Next, you create a HyperPod cluster with LDAPS/Active Directory integration.

  1. Find the configuration file config.py in your lifecycle script directory, open it with your text editor, and edit the properties in the Config class and SssdConfig class:
    1. Set True for enable_sssd to enable setting up SSSD.
    2. The SssdConfig class contains configuration parameters for SSSD.
    3. Make sure you use the obfuscated password for the ldap_default_authtok property, not a plain text password.
    # Basic configuration parameters
    class Config:
             :
        # Set true if you want to install SSSD for ActiveDirectory/LDAP integration.
        # You need to configure parameters in SssdConfig as well.
        enable_sssd = True
    # Configuration parameters for ActiveDirectory/LDAP/SSSD
    class SssdConfig:
    
        # Name of domain. Can be default if you are not sure.
        domain = "default"
    
        # Comma separated list of LDAP server URIs
        ldap_uri = "ldaps://nlb-ds-xyzxyz.elb.us-west-2.amazonaws.com"
    
        # The default base DN to use for performing LDAP user operations
        ldap_search_base = "dc=hyperpod,dc=abc123,dc=com"
    
        # The default bind DN to use for performing LDAP operations
        ldap_default_bind_dn = "CN=ReadOnly,OU=Users,OU=hyperpod,DC=hyperpod,DC=abc123,DC=com"
    
        # "password" or "obfuscated_password". Obfuscated password is recommended.
        ldap_default_authtok_type = "obfuscated_password"
    
        # You need to modify this parameter with the obfuscated password, not plain text password
        ldap_default_authtok = "placeholder"
    
        # SSH authentication method - "password" or "publickey"
        ssh_auth_method = "publickey"
    
        # Home directory. You can change it to "/home/%u" if your cluster doesn't use FSx volume.
        override_homedir = "/fsx/%u"
    
        # Group names to accept SSH login
        ssh_allow_groups = {
            "controller" : ["ClusterAdmin", "ubuntu"],
            "compute" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
            "login" : ["ClusterAdmin", "ClusterDev", "ubuntu"],
        }
    
        # Group names for sudoers
        sudoers_groups = {
            "controller" : ["ClusterAdmin", "ClusterDev"],
            "compute" : ["ClusterAdmin", "ClusterDev"],
            "login" : ["ClusterAdmin", "ClusterDev"],
        }
    

  2. Copy the certificate file ldaps.crt to the same directory (where config.py exists).
  3. Upload the modified lifecycle script files to your Amazon Simple Storage Service (Amazon S3) bucket, and create a HyperPod cluster with it.
  4. Wait until the status changes to InService.

Verification

Let’s verify the solution by logging in to the cluster with SSH. Because the cluster was created in a private subnet, you can’t directly SSH into the cluster from your local environment. You can choose from two options to connect to the cluster.

Option 1: SSH login through AWS Systems Manager

You can use AWS Systems Manager as a proxy for the SSH connection. Add a host entry to the SSH configuration file ~/.ssh/config using the following example. For the HostName field, specify the Systems Manger target name in the format of sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]. For the IdentityFile field, specify the file path to the user’s SSH private key. This field is not required if you chose password authentication.

Host MyCluster-LoginNode
    HostName sagemaker-cluster:abcd1234_LoginGroup-i-01234567890abcdef
    User user1
    IdentityFile ~/keys/my-cluster-ssh-key.pem
    ProxyCommand aws --profile default --region us-west-2 ssm start-session --target %h --document-name AWS-StartSSHSession --parameters portNumber=%p

Run the ssh command using the host name you specified. Confirm you can log in to the instance with the specified user.

$ ssh MyCluster-LoginNode
   :
   :
   ____              __  ___     __             __ __                  ___          __
  / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
 _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
/___/_,_/_, /__/_/  /_/_,_/_/_\__/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
         /___/                                    /___/_/
You're on the controller
Instance Type: ml.m5.xlarge
user1@ip-10-1-111-222:~$

At this point, users can still use the Systems Manager default shell session to log in to the cluster as ssm-user with administrative privileges. To block the default Systems Manager shell access and enforce SSH access, you can configure your IAM policy by referring to the following example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": [
                "arn:aws:sagemaker:us-west-2:123456789012:cluster/abcd1234efgh",
                "arn:aws:ssm:us-west-2:123456789012:document/AWS-StartSSHSession"
            ],
            "Condition": {
                "BoolIfExists": {
                    "ssm:SessionDocumentAccessCheck": "true"
                }
            }
        }
    ]
}

For more details on how to enforce SSH access, refer to Start a session with a document by specifying the session documents in IAM policies.

Option 2: SSH login through bastion host

Another option to access the cluster is to use a bastion host as a proxy. You can use this option when the user doesn’t have permission to use Systems Manager sessions, or to troubleshoot when Systems Manager is not working.

  1. Create a bastion security group that allows inbound SSH access (TCP port 22) from your local environment.
  2. Update the security group for the cluster to allow inbound SSH access from the bastion security group.
  3. Create an EC2 Linux instance.
  4. For Amazon Machine Image, choose Ubuntu Server 20.04 LTS.
  5. For Instance type, choose t3.small.
  6. In the Network settings section, provide the following parameters:
    1. For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
    2. For Subnet, choose the public subnet you created with the CloudFormation template.
    3. For Common security groups, choose the bastion security group you created.
  7. For Configure storage, set storage to 8 GB.
  8. Identify the public IP address of the bastion host and the private IP address of the target instance (for example, the login node of the cluster), and add two host entries in the SSH config, by referring to the following example:
    Host Bastion
        HostName 11.22.33.44
        User ubuntu
        IdentityFile ~/keys/my-bastion-ssh-key.pem
    
    Host MyCluster-LoginNode-with-Proxy
        HostName 10.1.111.222
        User user1
        IdentityFile ~/keys/my-cluster-ssh-key.pem
        ProxyCommand ssh -q -W %h:%p Bastion

  9. Run the ssh command using the target host name you specified earlier, and confirm you can log in to the instance with the specified user:
    $ ssh MyCluster-LoginNode-with-Proxy
       :
       :
       ____              __  ___     __             __ __                  ___          __
      / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ ___  ___/ /
     _ / _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ / -_) __/ ___/ _ / _  /
    /___/_,_/_, /__/_/  /_/_,_/_/_\__/_/   /_//_/_, / .__/__/_/ /_/   ___/_,_/
             /___/                                    /___/_/
    You're on the controller
    Instance Type: ml.m5.xlarge
    user1@ip-10-1-111-222:~$

Clean up

Clean up the resources in the following order:

  1. Delete the HyperPod cluster.
  2. Delete the Network Load Balancer.
  3. Delete the load balancing target group.
  4. Delete the certificate imported to Certificate Manager.
  5. Delete the EC2 Windows instance.
  6. Delete the EC2 Linux instance for the bastion host.
  7. Delete the AWS Managed Microsoft AD.
  8. Delete the CloudFormation stack for the VPC, subnets, security group, and FSx for Lustre volume.

Conclusion

This post provided steps to create a HyperPod cluster integrated with Active Directory. This solution removes the hassle of user maintenance on large-scale clusters and allows you to manage users and groups centrally in one place.

For more information about HyperPod, check out the HyperPod workshop and the SageMaker HyperPod Developer Guide. Leave your feedback on this solution in the comments section.


About the Authors

Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Monidipa Chakraborty currently serves as a Senior Software Development Engineer at Amazon Web Services (AWS), specifically within the SageMaker HyperPod team. She is committed to assisting customers by designing and implementing robust and scalable systems that demonstrate operational excellence. Bringing nearly a decade of software development experience, Monidipa has contributed to various sectors within Amazon, including Video, Retail, Amazon Go, and AWS SageMaker.

Satish Pasumarthi is a Software Developer at Amazon Web Services. With several years of software engineering and an ML background, he loves to bridge the gap between the ML and systems and is passionate to build systems that make large scale model training possible. He has worked on projects in a variety of domains, including Machine Learning frameworks, model benchmarking, building hyperpod beta involving a broad set of AWS services. In his free time, Satish enjoys playing badminton.