GPU Compute Resources

GHPC provides GPU resources through the Slurm partition:

ghpc_gpu

The GPU partition can be used for workloads such as machine learning, deep learning, PyTorch, TensorFlow, CUDA-based applications, and other GPU-accelerated jobs.


Available GPU Nodes

The ghpc_gpu partition currently contains the following GPU nodes:

NodeGPU configurationGPU memorySystem memory
gpu011 × NVIDIA L40S48 GB185 GB
gpu022 × NVIDIA L40S48 GB per GPU383 GB

Users normally should not request a specific GPU node. Slurm will automatically select a suitable available GPU node based on the requested GPU and memory resources.

For example, use:

#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:1

Avoid forcing a node unless there is a specific reason:

#SBATCH --nodelist=gpu01

or:

#SBATCH --nodelist=gpu02

Requesting a GPU Job

To request one GPU in a batch job:

#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:1

To request two GPUs, for example on gpu02:

#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:2

A job requesting two GPUs can only run on a node with at least two available GPUs.


Important: GPU Visibility in Slurm Jobs

When a job requests a GPU, Slurm controls which GPU devices are visible inside the job.

For example:

#SBATCH --gres=gpu:1

will expose only one GPU to the job, even if the physical node has more GPUs.

Inside the job, you can check this with:

echo $CUDA_VISIBLE_DEVICES

For a one-GPU job, this may show:

0

For a two-GPU job, this may show:

0,1

This is expected behaviour.


Interactive GPU Session

For testing or debugging, users can start an interactive GPU session:

srun -p ghpc_gpu --gres=gpu:1 --mem=8G --pty bash

Then check which node was assigned:

hostname

Check GPU visibility:

echo $CUDA_VISIBLE_DEVICES
nvidia-smi

Exit the session when finished:

exit

To request two GPUs interactively:

srun -p ghpc_gpu --gres=gpu:2 --mem=16G --pty bash

Below is a recommended Slurm batch script for GPU jobs.

#!/bin/bash
#--------------------------------------------------------------------------#
# Job Specifications for GPU Usage
#--------------------------------------------------------------------------#
#SBATCH -p ghpc_gpu                    # GPU partition
#SBATCH -N 1                           # Number of nodes
#SBATCH -n 1                           # Number of tasks
#SBATCH --mem=8192                     # Memory in MiB, e.g. 8192 = 8 GB
#SBATCH --gres=gpu:1                   # Request 1 GPU
#SBATCH -J gpu_job                     # Job name
#SBATCH --output=slurm_%x_%A.out       # STDOUT
#SBATCH --error=slurm_%x_%A.err        # STDERR
#SBATCH -t 1:00:00                     # Max runtime: HH:MM:SS

set -euo pipefail

#--------------------------------------------------------------------------#
# Temporary directory
#--------------------------------------------------------------------------#
TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"

#--------------------------------------------------------------------------#
# Activate Python environment
#--------------------------------------------------------------------------#
# Example shared Python environment:
source /usr/lib/python3.11/venv/bin/activate

# If you use your own Conda environment, use something like:
# source /path/to/miniconda3/etc/profile.d/conda.sh
# conda activate my_environment

#--------------------------------------------------------------------------#
# GPU and environment debugging information
#--------------------------------------------------------------------------#
echo "Running on node: $(hostname)"
echo "SLURM job ID: $SLURM_JOBID"
echo "SLURM node list: $SLURM_JOB_NODELIST"
echo "TMPDIR: $TMPDIR"
echo "Python: $(which python)"
echo "Python version: $(python --version)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "SLURM_JOB_GPUS=$SLURM_JOB_GPUS"
echo "SLURM_STEP_GPUS=$SLURM_STEP_GPUS"

python - <<'PY'
import os
import torch

print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("Torch version:", torch.__version__)
print("Torch CUDA:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())

if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))
PY

#--------------------------------------------------------------------------#
# Run GPU workload
#--------------------------------------------------------------------------#
# Recommended: start the Python workload through srun
srun python my_gpu_script.py

#--------------------------------------------------------------------------#
# Cleanup
#--------------------------------------------------------------------------#
cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"

Why Use srun Inside a GPU Batch Job?

In GPU batch jobs, it is recommended to start the actual workload using srun, for example:

srun python my_gpu_script.py

instead of:

python my_gpu_script.py

Using srun helps ensure that Slurm correctly applies the allocated job resources, including GPU-related environment variables.


PyTorch GPU Test

To check whether PyTorch can see the GPU inside a Slurm job:

python - <<'PY'
import os
import torch

print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("Torch version:", torch.__version__)
print("Torch CUDA:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())

if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))
PY

Expected output should include:

CUDA available: True
GPU count: 1
GPU name: NVIDIA L40S

For a two-GPU job, GPU count should show:

GPU count: 2

Common Reasons PyTorch Does Not Use the GPU

If nvidia-smi shows no GPU process, or if PyTorch reports:

torch.cuda.is_available()
False

check the following:

  1. The job was submitted with a GPU request:

    #SBATCH --gres=gpu:1
    
  2. The Python workload is started with srun:

    srun python my_gpu_script.py
    
  3. The correct Python or Conda environment is activated.

  4. The installed PyTorch version supports CUDA.

  5. The Python code explicitly moves the model and data to CUDA, for example:

    import torch
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    

    Input tensors also need to be moved to the GPU, for example:

    x = x.to(device)
    y = y.to(device)
    

Temporary Directory on /scratch

GPU jobs should use local scratch space for temporary files:

TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"

At the end of the job, remove the temporary directory:

cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"

Do not store important long-term data only in /scratch. Copy final results back to your project or home directory before the job finishes.


Example: One-GPU PyTorch Job

#!/bin/bash
#SBATCH -p ghpc_gpu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH -J pytorch_test
#SBATCH --output=slurm_%x_%A.out
#SBATCH --error=slurm_%x_%A.err
#SBATCH -t 1:00:00

set -euo pipefail

TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"

source /usr/lib/python3.11/venv/bin/activate

echo "Running on node: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

srun python - <<'PY'
import torch
import os

print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())

if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
PY

cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"

Example: Two-GPU Job

To request both GPUs on a suitable node:

#!/bin/bash
#SBATCH -p ghpc_gpu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=32G
#SBATCH --gres=gpu:2
#SBATCH -J two_gpu_test
#SBATCH --output=slurm_%x_%A.out
#SBATCH --error=slurm_%x_%A.err
#SBATCH -t 1:00:00

set -euo pipefail

TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"

source /usr/lib/python3.11/venv/bin/activate

echo "Running on node: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

srun python - <<'PY'
import torch
import os

print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())

for i in range(torch.cuda.device_count()):
    print(i, torch.cuda.get_device_name(i))
PY

cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"

Checking Job Status

Show GPU jobs:

squeue -p ghpc_gpu

Show GPU nodes:

sinfo -N -p ghpc_gpu -o "%N %P %T %c %m %e %G"

Show detailed information about a job:

scontrol show job <JOBID>

Show detailed information about a GPU node:

scontrol show node gpu01
scontrol show node gpu02

Summary

Use the GPU partition with:

#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:1

For most jobs, do not force a specific GPU node. Let Slurm choose the appropriate node automatically.

For GPU batch jobs, start the Python workload with:

srun python my_gpu_script.py

Use the CUDA/PyTorch test commands above to verify that the GPU is visible inside the job.