Working Remotely with GPU resources
Connected to the naplab, we have several GPU resources. This page will give you a short introduction to working remotely with GPU resources.
Resources:
Nap01 Server
HPC IDUN Cluster (hpc.ntnu.no)
Rules
For the nap01 server, please follow the following rules:
Use nvidia-docker to run jobs
When starting a docker container, name the container with {ntnu_username}_...
When creating a docker image, name the image {ntnu_username}/image_name
ALWAYS check nvidia-smi, to be certain that nobody is using the GPU you want to use
Connecting to nap01.idi.ntnu.no
You can connect to the server by using ssh:
Nap01 has two NVIDIA V100-32GB GPUs and 2x Intel Xeon Gold 6132 CPUs.
Storage
There is two places to store data on the server:
/lhome/ntnu_username: This is a 1TB disk where you should launch your programs from. However, you should NOT store large amount of data on this disk! It is also taken backup of this disk
/work/ntnu_username: This disk is for storing larger datasets. If you don't have a directory there, contact Frank or Håkon.
Useful Commands
df -h: View disk space on the server
htop: View cpu/RAM usage on the server
nvidia-smi: View VRAM/GPU usage on the V100 cards.
Using docker
Docker tips & tricks
Docker commands can become very long, with several static settings. To make your life easier, you can create simple python scripts to start a docker container. For example:
There is a couple of important settings to change here:
Change docker_name to your ntnu username
{NTNU-USERNAME}_...
Change the
-u
argument in the cmd list. You can find your ID by logging onto the server, then runid -u ntnu_username
, for exampleid -u haakohu
. This is to prevent the docker container to save files as administrator, which can easily mess up your project files.The
-v
argument to mount folders. In the script, we are only mounting your current directory to /workspace in the docker container. If you need to mount something else, you can add several -v argumentsThe docker image.
Save this with the filename run_docker
and make it an executable by running
Then, I can start the training script with on GPU id 0:
If you want to start a job without GPU, you can run:
This will execute the following docker cmd:
Pre-built docker images
Nvidia GPU Cloud has several pre-built docker images for Nvidia systems:
Mounting a server disk to your local filesystem
Working remotely can be a hassle without mounting the remote filesystem. If you mount a folder to your local computer, you can use your favorite texteditor to work on it.
We recommend you to use sshfs:
Storage folders on nap01.idi.ntnu.no
For larger datasets, we recommend you to store your datasets on a different disk than the main SSD.
This can be found under /work/ntnu_username
. To get a directory for your username, contact Håkon.
Utilizing the full potential of V100 cards
The V-100 cards are extremely powerful, and requires optimized code to realise the full computing potential.
1. nvidia-smi
You can see the utilization of the gpu's by running watch -n 0,5 nvidia-smi
. Your code should be running at 90% + utilization most of the time.
2. Utilizing tensor cores
The V100 cards has about 600 tensor cores, which has some weird requirements. Most DL libraries run your operations automatic on tensor cores if you satisfy the following requirements:
Number of filters in your CNN is divisible by 8
Your batch size is divisble by 8
Your parameters/input data is floating point 16
The first two requirements are rather easy to satisfy, however, training a CNN with floating point 16 is hard. To get proper training of your network with 16 bit floating point, you are required to train with mixed precision training. Therefore, we recommend the following two resources to get started with this:
With my code, I got a 220% speed up without loosing any performance.
3. Profiling your code
If your code is running slow and you can't find the bottleneck, profiling is your best friend.
Last updated
Was this helpful?