GPU Compute Resources
GHPC provides GPU resources through the Slurm partition:
ghpc_gpu
The GPU partition can be used for workloads such as machine learning, deep learning, PyTorch, TensorFlow, CUDA-based applications, and other GPU-accelerated jobs.
Available GPU Nodes
The ghpc_gpu partition currently contains the following GPU nodes:
| Node | GPU configuration | GPU memory | System memory |
|---|---|---|---|
gpu01 | 1 × NVIDIA L40S | 48 GB | 185 GB |
gpu02 | 2 × NVIDIA L40S | 48 GB per GPU | 383 GB |
Users normally should not request a specific GPU node. Slurm will automatically select a suitable available GPU node based on the requested GPU and memory resources.
For example, use:
#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:1
Avoid forcing a node unless there is a specific reason:
#SBATCH --nodelist=gpu01
or:
#SBATCH --nodelist=gpu02
Requesting a GPU Job
To request one GPU in a batch job:
#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:1
To request two GPUs, for example on gpu02:
#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:2
A job requesting two GPUs can only run on a node with at least two available GPUs.
Important: GPU Visibility in Slurm Jobs
When a job requests a GPU, Slurm controls which GPU devices are visible inside the job.
For example:
#SBATCH --gres=gpu:1
will expose only one GPU to the job, even if the physical node has more GPUs.
Inside the job, you can check this with:
echo $CUDA_VISIBLE_DEVICES
For a one-GPU job, this may show:
0
For a two-GPU job, this may show:
0,1
This is expected behaviour.
Interactive GPU Session
For testing or debugging, users can start an interactive GPU session:
srun -p ghpc_gpu --gres=gpu:1 --mem=8G --pty bash
Then check which node was assigned:
hostname
Check GPU visibility:
echo $CUDA_VISIBLE_DEVICES
nvidia-smi
Exit the session when finished:
exit
To request two GPUs interactively:
srun -p ghpc_gpu --gres=gpu:2 --mem=16G --pty bash
Recommended GPU Batch Job Template
Below is a recommended Slurm batch script for GPU jobs.
#!/bin/bash
#--------------------------------------------------------------------------#
# Job Specifications for GPU Usage
#--------------------------------------------------------------------------#
#SBATCH -p ghpc_gpu # GPU partition
#SBATCH -N 1 # Number of nodes
#SBATCH -n 1 # Number of tasks
#SBATCH --mem=8192 # Memory in MiB, e.g. 8192 = 8 GB
#SBATCH --gres=gpu:1 # Request 1 GPU
#SBATCH -J gpu_job # Job name
#SBATCH --output=slurm_%x_%A.out # STDOUT
#SBATCH --error=slurm_%x_%A.err # STDERR
#SBATCH -t 1:00:00 # Max runtime: HH:MM:SS
set -euo pipefail
#--------------------------------------------------------------------------#
# Temporary directory
#--------------------------------------------------------------------------#
TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"
#--------------------------------------------------------------------------#
# Activate Python environment
#--------------------------------------------------------------------------#
# Example shared Python environment:
source /usr/lib/python3.11/venv/bin/activate
# If you use your own Conda environment, use something like:
# source /path/to/miniconda3/etc/profile.d/conda.sh
# conda activate my_environment
#--------------------------------------------------------------------------#
# GPU and environment debugging information
#--------------------------------------------------------------------------#
echo "Running on node: $(hostname)"
echo "SLURM job ID: $SLURM_JOBID"
echo "SLURM node list: $SLURM_JOB_NODELIST"
echo "TMPDIR: $TMPDIR"
echo "Python: $(which python)"
echo "Python version: $(python --version)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "SLURM_JOB_GPUS=$SLURM_JOB_GPUS"
echo "SLURM_STEP_GPUS=$SLURM_STEP_GPUS"
python - <<'PY'
import os
import torch
print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("Torch version:", torch.__version__)
print("Torch CUDA:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
if torch.cuda.is_available():
print("GPU name:", torch.cuda.get_device_name(0))
PY
#--------------------------------------------------------------------------#
# Run GPU workload
#--------------------------------------------------------------------------#
# Recommended: start the Python workload through srun
srun python my_gpu_script.py
#--------------------------------------------------------------------------#
# Cleanup
#--------------------------------------------------------------------------#
cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"
Why Use srun Inside a GPU Batch Job?
In GPU batch jobs, it is recommended to start the actual workload using srun, for example:
srun python my_gpu_script.py
instead of:
python my_gpu_script.py
Using srun helps ensure that Slurm correctly applies the allocated job resources, including GPU-related environment variables.
PyTorch GPU Test
To check whether PyTorch can see the GPU inside a Slurm job:
python - <<'PY'
import os
import torch
print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("Torch version:", torch.__version__)
print("Torch CUDA:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
if torch.cuda.is_available():
print("GPU name:", torch.cuda.get_device_name(0))
PY
Expected output should include:
CUDA available: True
GPU count: 1
GPU name: NVIDIA L40S
For a two-GPU job, GPU count should show:
GPU count: 2
Common Reasons PyTorch Does Not Use the GPU
If nvidia-smi shows no GPU process, or if PyTorch reports:
torch.cuda.is_available()
False
check the following:
-
The job was submitted with a GPU request:
#SBATCH --gres=gpu:1 -
The Python workload is started with
srun:srun python my_gpu_script.py -
The correct Python or Conda environment is activated.
-
The installed PyTorch version supports CUDA.
-
The Python code explicitly moves the model and data to CUDA, for example:
import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device)Input tensors also need to be moved to the GPU, for example:
x = x.to(device) y = y.to(device)
Temporary Directory on /scratch
GPU jobs should use local scratch space for temporary files:
TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"
At the end of the job, remove the temporary directory:
cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"
Do not store important long-term data only in /scratch. Copy final results back to your project or home directory before the job finishes.
Example: One-GPU PyTorch Job
#!/bin/bash
#SBATCH -p ghpc_gpu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH -J pytorch_test
#SBATCH --output=slurm_%x_%A.out
#SBATCH --error=slurm_%x_%A.err
#SBATCH -t 1:00:00
set -euo pipefail
TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"
source /usr/lib/python3.11/venv/bin/activate
echo "Running on node: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
srun python - <<'PY'
import torch
import os
print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0))
PY
cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"
Example: Two-GPU Job
To request both GPUs on a suitable node:
#!/bin/bash
#SBATCH -p ghpc_gpu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=32G
#SBATCH --gres=gpu:2
#SBATCH -J two_gpu_test
#SBATCH --output=slurm_%x_%A.out
#SBATCH --error=slurm_%x_%A.err
#SBATCH -t 1:00:00
set -euo pipefail
TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p "$TMPDIR"
source /usr/lib/python3.11/venv/bin/activate
echo "Running on node: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
srun python - <<'PY'
import torch
import os
print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
print(i, torch.cuda.get_device_name(i))
PY
cd "$SLURM_SUBMIT_DIR"
rm -rf "$TMPDIR"
Checking Job Status
Show GPU jobs:
squeue -p ghpc_gpu
Show GPU nodes:
sinfo -N -p ghpc_gpu -o "%N %P %T %c %m %e %G"
Show detailed information about a job:
scontrol show job <JOBID>
Show detailed information about a GPU node:
scontrol show node gpu01
scontrol show node gpu02
Summary
Use the GPU partition with:
#SBATCH -p ghpc_gpu
#SBATCH --gres=gpu:1
For most jobs, do not force a specific GPU node. Let Slurm choose the appropriate node automatically.
For GPU batch jobs, start the Python workload with:
srun python my_gpu_script.py
Use the CUDA/PyTorch test commands above to verify that the GPU is visible inside the job.