Categories: AI/ML Research

Training a Model on Multiple GPUs with Data Parallelism

This article is divided into two parts; they are: • Data Parallelism • Distributed Data Parallelism If you have multiple GPUs, you can combine them to operate as a single GPU with greater memory capacity.

Train Your Large Model on Multiple GPUs with Tensor Parallelism

This article is divided into five parts; they are: • An Example of Tensor Parallelism • Setting Up Tensor Parallelism • Preparing Model for Tensor Parallelism • Train a Model with Tensor Parallelism • Combining Tensor Parallelism with FSDP Tensor parallelism originated from the Megatron-LM paper.

January 1, 2026

In "AI/ML Research"

Train Your Large Model on Multiple GPUs with Fully Sharded Data Parallelism

This article is divided into five parts; they are: • Introduction to Fully Sharded Data Parallel • Preparing Model for FSDP Training • Training Loop with FSDP • Fine-Tuning FSDP Behavior • Checkpointing FSDP Models Sharding is a term originally used in database management systems, where it refers to dividing…

December 31, 2025

In "AI/ML Research"

Train Your Large Model on Multiple GPUs with Pipeline Parallelism

This article is divided into six parts; they are: • Pipeline Parallelism Overview • Model Preparation for Pipeline Parallelism • Stage and Pipeline Schedule • Training Loop • Distributed Checkpointing • Limitations of Pipeline Parallelism Pipeline parallelism means creating the model as a pipeline of stages.

December 30, 2025

In "AI/ML Research"

AI Generated Robotic Content