ML 9589 image001

Distributed training with Amazon EKS and Torch Distributed Elastic

Distributed deep learning model training is becoming increasingly important as data sizes are growing in many industries. Many applications in computer vision and natural language processing now require training of deep learning models, which are growing exponentially in complexity and are often trained with hundreds of terabytes of data. It then becomes important to use …

12A4WmaVC20kLepBrUPtWlJDA 1

The Benefits of Running Kubernetes on Ephemeral Compute

This is the third post in our blog series on Rubix (#1, #2), our effort to rebuild our cloud architecture around Kubernetes. Introduction The advent of containers and their orchestration platforms, most popularly Kubernetes (K8s), has familiarized engineers with the concept of ephemeral compute: once a workload has completed, the resources used to run it …