Running batch jobs

The most common usecase of slurm to its users is to manage batch jobs.

What are batch jobs?

A batch job, usually in the form of a shell script requests computing resources and specifies the application(s) to launch on those resources along with any input data/options and output directives. As a workload manager, slurm is expected to fulfill the job’s request at the soonest available time constrained only by resource availability and user limits.

Anatomy of batch job aimed at SLURM

A batch job is a shell script that consists of two parts: resource requests and job steps.

Resource requests section involves specifying the number or amount of resources that are requested to be allocated for the execution of this job. Typical set of resources that need to be specified include number of CPU(cores), amount of memory/RAM, maximum time duration that the job is expected to run, where to write the results of the job etc.

Job steps section is essentially a bash script that describes the sequence of tasks to get user's work done.

An example of a batch would be as follows:

#!/bin/bash
#--------------------------------------------------------------------------#
#              Edit Job specifications                                     #    
#--------------------------------------------------------------------------#
#SBATCH -p ghpc                    # Name of the queue
#SBATCH -N 1                       # Number of nodes(DO NOT CHANGE)
#SBATCH -n 1                       # Number of CPU cores
#SBATCH --mem=1024                 # Memory in MiB(10 GiB = 10 * 1024 MiB)
#SBATCH -J template_job            # Name of the job
#SBATCH --output=slurm_%x_%A.out   # STDOUT
#SBATCH --error=slurm_%x_%A.err    # STDERR
#SBATCH -t 1:00:00              # Job max time - Format = MM or MM:SS or HH:MM:SS or DD-HH or DD-HH:MM
# Create a temporary directory for the job in local storage - DO NOT CHANGE #
TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p $TMPDIR
#=========================================================================#
#         Your job script                                                 #
#=========================================================================#
# Replace the following with your work to be executed. 
echo "Job started at $(date '+%d_%m_%y_%H_%M_%S')"
echo " Step 1: Generating and sorting random numbers"
for i in {1..500000}; do
echo $RANDOM >> SomeRandomNumbers.txt
done
echo "Job completed at $(date '+%d_%m_%y_%H_%M_%S')"

#=========================================================================#
#         Cleanup  DO NOT REMOVE OR CHANGE                                #
#=========================================================================#
cd $SLURM_SUBMIT_DIR
rm -rf /scratch/$USER/$SLURM_JOBID

Where,

ghpc is the name of the queue that this job is sent to. Refer to Job queues section of this guide to learn more about different queues and the associated hardware.

N denotes the number of nodes(dedicated physical servers) that are requested by this job. GHPC is not configured to run nodewise parallel jobs at the moment. Submitting jobs with N > 1 will make the jobs fail.

n denotes the number of CPU cores or hyperthreads requested by this job. The idea of Hyperthreads is discussed later in this page.

--mem denotes the amount of memory to be reserved for this job. If the job tries to use more than this amount of memory, Slurm will kill the job and write an error stating "out of memory". Knowing the amount of memory needed for a job can be tricky. It is discussed later in this page.

-J denotes the name of the job that is used for identification purposes in the output of several commands. It does not need to be unique, however it would ideally be human friendly to understand the purpose of the job.

--output specifies where to write the standard output that was recorded while executing this script. The echo statements in the script write to stdout and will be piped to the specified file.

--error specifies where to write the standard error similar to stdout.

-t specifies the maximum run time that this job is expected to hold on to the requested resources. If the job completes before the time limit is reached, then the resources will be freed automatically. If the job is not complete before the time limit is reached, slurm will kill the job as it is.

The lines with TMPDIR are creating a temporary directory on local storage for your job. As a thumb rule, leave those lines(and the two lines at the end with rm-rf command) as they are in all scripts you run.

Job queues

GHPC is busy running users' jobs and may not be able to execute incoming jobs right away due to resource availability or resource limits. In such a case, the job exists in a queue. In Slurm, a queue is associated with a specific set of dedicated or shared compute node resources. Queues are defined by your sysadmin and are configured with a specific set of limits such as job size, max run time limits, which users are allowed to run jobs in that queue etc.

Currently, the following queues are defined in GHPC

  • ghpc
  • zen4
  • nav
  • nav_zen4

Refer to hardware section of this guide to understand which queue is associated with which resources and what their limits are.

Understanding CPU at GHPC

In Slurm terminilogy, the processing units on nodes are the cores. However, with the advent of Simultaneous Multithreading (SMT) architectures, single cores can have multiple hardware threads (sometimes known as hyper-threads). The operating system running on servers sees the hardware threads as logical CPUs (while they are shared cores at the hardware level). For example, a server with an intel Xeon CPU containing 12 physical cores would be seen as having 24 CPUs by Linux if the HyperThreading feature is turned on. However, if HyperThreading is turned off, Linux would only see 12 CPUs.

This abstraction of logical CPUs impacts the level of memory bandwidth available to cores in purely memory bound scientific applications. However, in IO bound workloads, and some CPU bound workloads, the CPU waits are high enough anyway that having hardware threads could theoretically improve performance by upto 100%.

In order to improve the resource utilisation efficiency of GHPC cluster, the servers are configured with HyperThreading ON by default. This means that, if you have an application such as DMU that is known to be memory bandwidth bound, you would have to consciously choose double the number of cores as they expect to be made available to DMU to get the best performance.

References:

Understanding memory as a resource

Memory as a resource needs to be thought of in 2 dimensions. First, memory bandwidth: the amount of memory operations that can be performed per unit of time. Processors have memory channels through which they access memory and the bandwidth is typically shared by all the processor's cores or hyperthreads. Running several jobs on the same node, while all of them perform intensive memory read/write operations at the same time can be detrimental to the overall performance. However, it can be alleviated by requesting double the number of "cores" you need for the job thereby making sure you have plenty of memory bandwidth needed by the cores in action.

Second, sheer amount of addressable memory. Memory is often an expensive resources and needs to be used wisely to achieve the best results. With GHPC, the default allocation is 10.4 GiB of memory per core. However, not all jobs need or use that amount of memory, while some use more than that amount of memory per core. So, it is required for the jobs to specify approximately how much memory their job is expected to use. If the job tries to use more, then it will be killed by Slurm with an appropriate error message. On the other hand if a job requestes too much memory than needed, it is still counting towards the user's memory limits and their other jobs will wait in queue until these resources are relinquished to satisfy the limits. A good balance is to request for a reasonable amount of memory + 10% contingency.

SLURM environment variables

You can use Slurm environment variables in your job scripts to make re-usable job scripts.

$SLURM_JOB_ID ID of job allocation

$SLURM_SUBMIT_DIR Directory job where was submitted

$SLURM_JOB_NODELIST File containing allocated hostnames

$SLURM_NTASKS Total number of cores for job

Local disk storage for jobs

Eventhough networked storage is available across the entire cluster, it is important that users do NOT overload the central storage server with huge amounts of real-time IO including transient files and temporary results. Local storage is always faster for multiple accesses within the same job. If your job involves working with an input file several times and writing several transient files, copy the input files over to local storage at $TMPDIR and perform compute work on it only to copy over the results to the user's home directory as applicable at the end of the job.

In some cases where the input files are too big for the local node, the user may decide to run the job in such a way to read directly from networked storage. But this must be an informed decision rather than a blind guess.

The compute nodes have a minimum of 2 TiB of disk storage per node(16 or 32 cores). So, as long as your job does not go beyond 1 TiB of on disk data, and read/write several times, plan to use local storage. It is to be noted that disk storage is not a "consumable resource" in Slurm which means that a rogue job could potentially use up all of the disk storage thereby starving the rest of the jobs scheduled on that server and causing a serious inconvenience. So, practice caution, and ask your your sysadmin if you're in doubt.

The local storage is accessible at $TMPDIR

You can use the following pattern in your job scripts to make use of local storage.

# Create a temporary directory for the job in local storage
TMPDIR=/scratch/$USER/$SLURM_JOBID
export TMPDIR
mkdir -p $TMPDIR
cd $TMPDIR

# Copy the application binary if necessary to local storage
cp /path/to/software/executable .

# Copy input file(s) to local storage if necessary
cp /path/to/your/inputfiles .

# Execute the application with appropriate command

# Move the result files, logs, and any relevant data back to user's home directory. 
mkdir -p $SLURM_SUBMIT_DIR/$SLURM_JOBID
cp output/files/needed $SLURM_SUBMIT_DIR/$SLURM_JOBID/

# Delete local storage on the way out as a clean up
cd $SLURM_SUBMIT_DIR
rm -rf /scratch/$USER/$SLURM_JOBID

Note: Use a modified version of the above example as necessary. For example, if your job involves several steps, you can delete the transient files from a completed step before proceeding to the next step.

Remember: Node local storage is not backed up in any way. If your job leaves the files on local storage after its completion - count them as lost and unrecoverable, and also that you're causing inconvenience to other users.

Networked storage for jobs

GHPC uses a combination of an Isilon server and a NetApp filer as central storage that is accessible via NFS on all compute nodes and console servers. This means that your home directories are available on all compute nodes as they are on console[1,2] servers. However, please read the local storage section carefully and make use of it. Only copy the result files back to central networked storage.

Your home directories comes with storage quotas and are often limiting if you are working with huge datasets and result files. Contact your sysadmin by email if you need a space where to store huge amounts of research data.

So How do I submit my job for execution on SLURM?

Hopefully you have patiently read the prior sections on this page. After a good understanding of using resources, you're probably wondering how to actually submit your job to the cluster for execution.

If you have a job script like the one described in the example earlier, you are aware that all the resource requests are made within the job script. Hence it is very easy and straightforward to queue it up for execution with slurm.

asampath@console1:[~] > sbatch testslurm.sh
Submitted batch job 326

Where, testslurm.sh is the name of the job script file. You get a notification of the job ID, in this case - 326.

Follow along the guide to learn how to monitor running/completed jobs..