Training a ResNet-50 ImageNet Model using PyTorch on multiple AWS g4 or p3 Instances

Introduction

This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch”, a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances).  The guide, which is intended for anyone from beginners just starting out, to skilled practitioners,  focuses on deciding the right platform for the machine learning model that you want to deploy. 

The below tutorial is for people who have determined that a multi-node AWS g4 or p3 Instance is right for their machine learning workload.

Prepping the Model

As explained previously in this tutorial, increasing your GPU node count will help speed results, which is where the multi-node g4 and p3 instances come in. 

Just like with the single node, the setup process is the exact same for running on a multi-node g4 or a p3 instance. Simply choose your instance type and move forward. (In this case you will be selecting any one of the following multi-node instances: g4dn.12xlarge, g4dn.metal, p3.8xlarge, p3.16xlarge, or the p3dn.24xlarge.)

As explained previously in this tutorial, increasing your GPU node count will help speed results, which is where the multi-node g4 and p3 instances come in. 

Just like with the single node, the setup process is the exact same for running on a multi-node g4 or a p3 instance. Simply choose your instance type and move forward. (In this case you will be selecting any one of the following multi-node instances: g4dn.12xlarge, g4dn.metal, p3.8xlarge, p3.16xlarge, or the p3dn.24xlarge.)

As with the single node set up, you will need the following:

Your technology stack will include the following:

  • Model to train: ResNet-50
  • Machine learning framework: TensorFlow
  • Distributed training framework: Horovod
  • Multi node – Single/Multi GPU
  • Instance: p3.2xlarge or greater
  • AMI: Deep Learning AMI (v33 – aLinux)

Step 1 - Create an AMI image from the Single GPU use case

(If you choose to use our pre-staged AMI on the previous training (Tutorial 1) ami-0e22bababb010e6c5 (us-east-1), please skip ahead and launch a second instance with the same AMI id to step 2.

a) Go to EC2 Dashboard in AWS Console

b) Right click in the instance, go to Instance State, and click on Stop

ResNet-50 ImageNet Model screen grab 1

c) Insert an Image Name and “Create Image”

ResNet-50 ImageNet Model screen grab 2

Step 2 - Create a new instance with the AMI Image created

a) Click on “Launch Instance”

ResNet-50 ImageNet Model screen grab 3

b) Go to My AMIs and select the created or prebacked image

ResNet-50 ImageNet Model screen grab 4

c) Choose Instance Type: p3.2xlarge instance

d) Configure Instance: Select Subnet default us-east-1

ResNet-50 ImageNet Model screen grab 5

e) Add storage: Default

f) Add new Tag with:

Key: Name

Value: p3 – Node 2

ResNet-50 ImageNet Model screen grab 6

g) Security Group: Select the previously created (on Tutorial 1)

ResNet-50 ImageNet Model screen grab 7

h) Review and Launch the instance

i) Select the existing pair

ResNet-50 ImageNet Model screen grab 8

Step 3 - Create a New Security Group

a) In EC2 Dashboard, go to “Security Groups” and click “Create security group”

ResNet-50 ImageNet Model screen grab 9

b) Add security group name, description and set inbound and outbound rules with “All traffic”

ResNet-50 ImageNet Model screen grab 10

c) Press “Create Security Group”

d) Attach this security group to both nodes, p3 – Node 1 and p3 – Node 2

ResNet-50 ImageNet Model screen grab 11

e) Add the new one, without removing the previous security group

ResNet-50 ImageNet Model screen grab 12

Step 4 - Run both instances: p3 - Node 1 and p3 - Node 2

ResNet-50 ImageNet Model screen grab 13

a) Copy the IPv4 Public IP from Node 1

ResNet-50 ImageNet Model screen grab 14

b) And copy the Private IP from Node 2

ResNet-50 ImageNet Model screen grab 15

Step 5 - Now, from your local device copy the .pem certificate created in the previous case with SCP in p3 - Node 1

a) Move to the directory where you downloaded the key pairs (*.pem). Always, replace bold text with your information

cd <key_pair_directory>

b) Copy the key pair to your instance using SCP

scp -i <your .pem filename> <your .pem filename> ec2-user@<your instance IPv4 Public IP>:/home/ec2-user/examples/horovod/tensorflow/

Step 6 - Connect to your first instance (p3 - Node 1)

ssh -i <your .pem filename> ec2-user@<your instance IPv4 Public IP from Node 1>

Step 7 - Train the model

a) Move to the following folder:

cd ~/examples/horovod/tensorflow

b) Use vim to edit the hosts file

vim hosts

The file must be: 

localhost slots=1

<Private IP from Node 2> slots=1

c) Add the SSH key used by the member instances to the ssh-agent

eval `ssh-agent -s`
ssh-add <your .pem filename>.pem

d) Now, run the script to start training the model

./train.sh 2

e) After a few seconds you will see the results

ResNet-50 ImageNet Model screen grab 16

Avg Speed: 200

Step 8 - When finished or canceled, stop or terminate the instance

If you need to try with more nodes and/or GPUs, you must modify the hosts files with the numbers of slots in each node. And then, when you run the script define the total number of GPUs

(./train.sh <num of GPUs>)

Results | Training #1 - ResNet-50 ImageNet Model on Multiple GPU’s

Before starting, is important to identify on the NCCL debug that for instances types apart from p3dn.24xlarge, EFA provider is not supported

ResNet-50 ImageNet Model screen grab 17

Results Test 1

Type: Multi Node – Multi GPU

Number of instances: 2

Instance: p3.8xlarge

  • GPU: 4 GPU NVIDIA Tesla V100
  • GPU Memory: 64 GiB
  • Network Bandwidth: 10 Gbps

Result: Speed/ 50 Steps: ~770

ResNet-50 ImageNet Model screen grab 18

Results Test 2

Type: Single Node – Multi GPU

Number of Instances: 1

Instance: p3.16xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 128 GiB
  • Network Bandwidth: 25 Gbps

Result: Speed/ 50 Steps: ~910

ResNet-50 ImageNet Model screen grab 19

Results Test 3

Type: Multi Node – Multi GPU

Number of Instances: 2

Instance: p3.16xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 128 GiB
  • Network Bandwidth: 25 Gbps

Result: Speed/ 50 Steps: ~11500

ResNet-50 ImageNet Model screen grab 20

Results Test 4

Type: Multi Node – Multi GPU

Number of Instances: 4

Instance: p3.16xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 128 GiB
  • Network Bandwidth: 25 Gbps

Result: Speed/ 50 Steps: ~22500

ResNet-50 ImageNet Model screen grab 21

Results Test 5

Type: Single Node – Multi GPU

Number of Instances: 1

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Result: Speed/ 50 Steps: ~700

ResNet-50 ImageNet Model screen grab 22

Results Test 6

Type: Multi Node – Multi GPU

Number of Instances: 2

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Result: Speed/ 50 Steps: ~11180

ResNet-50 ImageNet Model screen grab 23

Results Test 7

Type: Multi Node – Multi GPU

Number of Instances: 2

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Now that we are using p3dns instances, let us look at the NCCL debug output to find the EFA provider enabled:

ResNet-50 ImageNet Model screen grab 24

Result: Speed / 50 Steps: ~11450

ResNet-50 ImageNet Model screen grab 25

Results Test 8

Type: Multi Node – Multi GPU

Number of Instances: 4

Instance: p3dn.24xlarge

  • GPU: 8 GPU NVIDIA Tesla V100
  • GPU Memory: 256 GiB
  • Network Bandwidth: 100 Gbps

Result: Speed / 50 Steps: ~22200

ResNet-50 ImageNet Model screen grab 26

Getting Help

Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.

More posts

Learn how to establish a Docker-based Redis cluster on Mac OS for local development. Solve the issue of connecting to the cluster from the host network.

Discover the latest trends, best practices and strategies to safeguard your organization's data while unlocking the full potential of cloud technologies and AI-driven solutions.

Explore the capability of AWS SageMaker by training a NeRF from a regular video and rendering it into a pixel-accurate volumetric representation of the space.

High-performance computing (HPC) workloads are demanding and require specialized hardware and software. However, the cloud can provide a cost-effective and scalable solution for HPC.