Home > Training a ResNet-50 ImageNet Model using PyTorch on multiple AWS g4 or p3 Instances
This document tags on to a blog post titled, “Tutorial: Getting started with a ML training model using AWS & PyTorch”, a tutorial that helps researchers to prepare a training model to run on the AWS cloud using NVIDIA GPU capable instances (including g4, p3, and p3dn instances). The guide, which is intended for anyone from beginners just starting out, to skilled practitioners, focuses on deciding the right platform for the machine learning model that you want to deploy.
The below tutorial is for people who have determined that a multi-node AWS g4 or p3 Instance is right for their machine learning workload.
As explained previously in this tutorial, increasing your GPU node count will help speed results, which is where the multi-node g4 and p3 instances come in.
Just like with the single node, the setup process is the exact same for running on a multi-node g4 or a p3 instance. Simply choose your instance type and move forward. (In this case you will be selecting any one of the following multi-node instances: g4dn.12xlarge, g4dn.metal, p3.8xlarge, p3.16xlarge, or the p3dn.24xlarge.)
As explained previously in this tutorial, increasing your GPU node count will help speed results, which is where the multi-node g4 and p3 instances come in.
Just like with the single node, the setup process is the exact same for running on a multi-node g4 or a p3 instance. Simply choose your instance type and move forward. (In this case you will be selecting any one of the following multi-node instances: g4dn.12xlarge, g4dn.metal, p3.8xlarge, p3.16xlarge, or the p3dn.24xlarge.)
As with the single node set up, you will need the following:
Your technology stack will include the following:
(If you choose to use our pre-staged AMI on the previous training (Tutorial 1) ami-0e22bababb010e6c5 (us-east-1), please skip ahead and launch a second instance with the same AMI id to step 2.
a) Go to EC2 Dashboard in AWS Console
b) Right click in the instance, go to Instance State, and click on Stop
c) Insert an Image Name and “Create Image”
a) Click on “Launch Instance”
b) Go to My AMIs and select the created or prebacked image
c) Choose Instance Type: p3.2xlarge instance
d) Configure Instance: Select Subnet default us-east-1
e) Add storage: Default
f) Add new Tag with:
Key: Name
Value: p3 – Node 2
g) Security Group: Select the previously created (on Tutorial 1)
h) Review and Launch the instance
i) Select the existing pair
a) In EC2 Dashboard, go to “Security Groups” and click “Create security group”
b) Add security group name, description and set inbound and outbound rules with “All traffic”
c) Press “Create Security Group”
d) Attach this security group to both nodes, p3 – Node 1 and p3 – Node 2
e) Add the new one, without removing the previous security group
a) Copy the IPv4 Public IP from Node 1
b) And copy the Private IP from Node 2
a) Move to the directory where you downloaded the key pairs (*.pem). Always, replace bold text with your information
cd <key_pair_directory>
b) Copy the key pair to your instance using SCP
scp -i <your .pem filename> <your .pem filename> ec2-user@<your instance IPv4 Public IP>:/home/ec2-user/examples/horovod/tensorflow/
ssh -i <your .pem filename> ec2-user@<your instance IPv4 Public IP from Node 1>
a) Move to the following folder:
cd ~/examples/horovod/tensorflow
b) Use vim to edit the hosts file
vim hosts
The file must be:
localhost slots=1
<Private IP from Node 2> slots=1
c) Add the SSH key used by the member instances to the ssh-agent
eval `ssh-agent -s`
ssh-add <your .pem filename>.pem
d) Now, run the script to start training the model
./train.sh 2
e) After a few seconds you will see the results
Avg Speed: 200
If you need to try with more nodes and/or GPUs, you must modify the hosts files with the numbers of slots in each node. And then, when you run the script define the total number of GPUs
(./train.sh <num of GPUs>)
Before starting, is important to identify on the NCCL debug that for instances types apart from p3dn.24xlarge, EFA provider is not supported
Type: Multi Node – Multi GPU
Number of instances: 2
Instance: p3.8xlarge
Result: Speed/ 50 Steps: ~770
Type: Single Node – Multi GPU
Number of Instances: 1
Instance: p3.16xlarge
Result: Speed/ 50 Steps: ~910
Type: Multi Node – Multi GPU
Number of Instances: 2
Instance: p3.16xlarge
Result: Speed/ 50 Steps: ~11500
Type: Multi Node – Multi GPU
Number of Instances: 4
Instance: p3.16xlarge
Result: Speed/ 50 Steps: ~22500
Type: Single Node – Multi GPU
Number of Instances: 1
Instance: p3dn.24xlarge
Result: Speed/ 50 Steps: ~700
Type: Multi Node – Multi GPU
Number of Instances: 2
Instance: p3dn.24xlarge
Result: Speed/ 50 Steps: ~11180
Type: Multi Node – Multi GPU
Number of Instances: 2
Instance: p3dn.24xlarge
Now that we are using p3dns instances, let us look at the NCCL debug output to find the EFA provider enabled:
Result: Speed / 50 Steps: ~11450
Type: Multi Node – Multi GPU
Number of Instances: 4
Instance: p3dn.24xlarge
Result: Speed / 50 Steps: ~22200
Clearly, there are a lot of considerations and factors to manage when deploying a machine learning model – or a fleet of machine learning models. Six Nines can help! If you would like to engage professionals to help your organization get its machine learning ambitions on the rails, contact us at sales@sixninesit.com.
Learn how to establish a Docker-based Redis cluster on Mac OS for local development. Solve the issue of connecting to the cluster from the host network.
Discover the latest trends, best practices and strategies to safeguard your organization's data while unlocking the full potential of cloud technologies and AI-driven solutions.
Explore the capability of AWS SageMaker by training a NeRF from a regular video and rendering it into a pixel-accurate volumetric representation of the space.
High-performance computing (HPC) workloads are demanding and require specialized hardware and software. However, the cloud can provide a cost-effective and scalable solution for HPC.