Other SLURM tasks

Displaying resources available in the cluster

When running jobs, it might become relevant to check what resources are available in teh cluster and request resources according to availability. Your sysadmin created an alias to easily get a summary of resources available in GHPC - ghpcinfo. It aliases the slurm command sinfo to provide information in an easy format.

# ghpcinfo
PARTITION    AVAIL TIMELIMIT    CPUS(A/I/O/T)  S:C:T    FREE_MEM       NODELIST
zen4         up    45-12:00:00  460/692/0/1152 2:32:2   49704-1118296  epyc[01-09]
nav_zen4     up    45-12:00:00  14/242/0/256   2:32:2   1350332-152040 epyc[10-11]
ghpc         up    45-12:00:00  592/1008/0/160 2:8+:2   49704-1118296  cas[1-8],epyc[01-09],sky[006-008,011-013]
ghpc_short   up    1-01:00:00   8/24/0/32      2:8:2    357323         sky008
nav          up    45-12:00:00  14/434/0/448   2:8+:2   302291-1520405 epyc[10-11],sky[001-005,014]

Where,

there are four queues namely - ghpc(default), zen4, ghpc_short, nav and nav_zen4.

CPUS(A/I/O/T) stands for Nodes (Active/Idle/Other/Total).

S:C:T stands for sockets:CPUs:Threads. 2:32:2 indicates the server has 2 sockets, each with 32 core CPUs and each core has 2 hyperthreads, totaling as 128 logical CPUs per server.

Canceling a job

Sometimes, you need to cancel a job that was submitted by mistake or with wrong specs etc.

Check your jobs using myst alias and find the job number that you want to cancel. Then, cancel it using scancelcommand.

asampath@console1:[~] > myst
             JOBID PARTITION     NAME     USER ST       TIME  NODES  CPUS MIN_MEMORY NODELIST(REASON)
               556       nav template asampath  R       0:07      1     2        10G c09b03.ghpc.au.dk
               557       nav template asampath  R       0:04      1     2        10G c09b03.ghpc.au.dk
asampath@console1:[~] > scancel 556
asampath@console1:[~] > myst
             JOBID PARTITION     NAME     USER ST       TIME  NODES  CPUS MIN_MEMORY NODELIST(REASON)
               557       nav template asampath  R       0:11      1     2        10G c09b03.ghpc.au.dk

Modifying Time limit of a running/pending job

Slurm does not allow users to increase the time limits of running jobs. If you submitted a job with a run time of x hours, and realize that perhaps it might need x+2 hours to finish, you can email your sysadmin requesting to increase the time limit of the job at any time. Do not expect immediate response. But, if you're lucky and your sysadmin read your email at that time, it can be done easily.

when you send an email, specify the job and time limit in text form, and do NOT send screenshots.

Moving a job to the top of your queue

At times, you may have a need to make a job run with higher priority than the rest of your jobs already waiting in queue. You can move any job to the top of your queue using..

scontrol top <jobid>

Resubmit a job

If you'd like to resubmit a job with same parameters,

scontrol requeue <jobid>