Checking on your job(s)

How do I monitor my jobs?

Now that you have submitted job(s), you can check the status of all the jobs in queue, using the alias myst

asampath@console1:[~] > myst
             JOBID PARTITION     NAME     USER ST       TIME  NODES  CPUS MIN_MEMORY NODELIST(REASON)
               326       nav template asampath  R       2:22      1     2        10G sky001.ghpc.au.dk

As you can see, the job with job ID - 326 that I submitted to the queue recently into the partition/queue nav, with the name template as username - asampath, has already started running, indicated by the state 'R' and has been running for 2 minutes and 22 seconds as of now, on 1 node, using 2 CPUs, with a memory limit of 10 GiBs and is being executed on a node identified by sky001.ghpc.au.dk.

On the other hand, if the cluster is busy serving a lot of jobs and your job is waiting for availability of resources it requested, you'd see the status as below.

asampath@console1:[~] > myst
             JOBID PARTITION     NAME     USER ST       TIME  NODES  CPUS MIN_MEMORY NODELIST(REASON)
               387       nav template asampath PD       0:00      1     1        10G (Resources)
               388       nav template asampath PD       0:00      1     1        10G (Priority)

Where, Resources reason indicates that the job is ready to run and is waiting for resources to be allocated, and Priority indicates that there are other jobs submitted before this job that needs to be scheduled before this one gets to execution or in other words, this job is waiting in queue.

Note: myst command is just a system-wide alias written by your sysadmin for the command - squeue -u $(whoami) -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.5C %.10m %R" so that it is easier to use.

Where is the output?

After a job completes (or sometimes during an execution of a job with several steps), you would want to see its output. With SLURM, you control where the output goes. Refer to your job script for the "--output" and "--error" sections to understand where the result files are written to.

In the example, we instructed SLURM to write a file "slurm_%A.out" in the home directory for stdout and "slurm_%A.err " for stderr.

asampath@console1:[~] > ls -alh slurm_326*
-rw-r--r--. 1 asampath qgg   0 Oct 28 15:10 slurm_326.err
-rw-r--r--. 1 asampath qgg 150 Oct 28 15:14 slurm_326.out 

If you copied the results from execution of commands to some specific location in your job script, then check those locations for the results.

How can I see my completed jobs?

At times, you need to look at your completed jobs and check their resource usage details so as to model the job scripts of similar jobs. You can get the details of all your jobs using the sacct command. However, your sysadmin has made a system-wide aliases called sj and saj to help you find the details about any particular job or all jobs you have ever submitted.

Details of a particular job:

If you want to know the details of a particular job - identified by its job ID, you can use the sj command as shown below.

asampath@console1:[~] > sj 430
       JobID    JobName     Group  ReqCPUS   TotalCPU     ReqMem     MaxRSS     AveRSS  MaxDiskRead MaxDiskWrite    Elapsed 
------------ ---------- --------- -------- ---------- ---------- ---------- ---------- ------------ ------------ ---------- 
430          template_+       qgg        1  00:05.742       10Gn                                                   00:10:00 
430.batch         batch                  2  00:05.742       10Gn     20216K     20216K        0.04M        0.15M   00:10:00 

Where,

the job requested 2 cpus, 10 GiB of memory, and used 20216 KiB (19.74 MiB) of memory at peak, wrote to disk at a rate of 0.15 MiB/s and ran for 10 minutes. This result indicates that if the job were to be resubmitted, it would be sufficient to request for just 1 cpu and 30 MiB of memory.

Details of a job while it is running:

If you need to see the status of a job that is currently running(to see how much memory it is using), you can use the alias - srj (status of a running job).

navtp@console1:~> srj 860
       JobID MaxRSSNode     AveRSS     MaxRSS  MaxVMSize     MinCPU 
------------ ---------- ---------- ---------- ---------- ---------- 
860.0              sky1      2508K     69036K    534412K  00:02.000

Details of all the jobs you ever ran:

If you dont know the job ID, and would like to query all your jobs, you can use the out of saj as a source. Note: It can be exhaustively long if you have ran a lot of jobs.

asampath@console1:[~] > saj
       JobID    JobName     Group  ReqCPUS   TotalCPU     ReqMem     MaxRSS     AveRSS  MaxDiskRead MaxDiskWrite    Elapsed 
------------ ---------- --------- -------- ---------- ---------- ---------- ---------- ------------ ------------ ---------- 
326          template_+       qgg        1  00:05.159       10Gn                                                   00:04:14 
326.batch         batch                  2  00:05.159       10Gn     20576K     20576K        0.87M        0.60M   00:04:14 
327          template_+       qgg        1  00:06.370       10Gn                                                   00:10:26 
327.batch         batch                  2  00:06.370       10Gn     20216K     20216K        0.04M        0.20M   00:10:26 
328          template_+       qgg        1  00:05.504       10Gn                                                   00:10:23 
328.batch         batch                  2  00:05.504       10Gn     20216K     20216K        0.04M        0.18M   00:10:23 
329          template_+       qgg        1  00:07.223       10Gn                                                   00:10:23 
329.batch         batch                  2  00:07.223       10Gn     20216K     20216K        0.04M        0.17M   00:10:23 
330          template_+       qgg        1  00:07.238       10Gn                                                   00:10:23 
330.batch         batch                  2  00:07.238       10Gn     20216K     20216K        0.04M        0.17M   00:10:23 

Trimmed for brevity

You can pipe its output to grep to filter information, or to less to page through this information.

Details of all the jobs you ran in the last 24 hours:

Another handy alias to list all jobs that were initiated in the last 24 hours (running, queued, completed, failed, cancelled). You can use the MaxRSS section to understand the peak memory usage of your job to model similar jobs in future.

asampath@console1:[~] > sajt
       JobID    JobName      User  Partition      State  ReqCPUS   TotalCPU     ReqMem     MaxRSS     AveRSS  MaxDiskRead MaxDiskWrite    Elapsed 
------------ ---------- --------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ------------ ------------ ---------- 
553          template_+  asampath        nav    RUNNING        1   00:00:00       10Gn                                                   00:00:32 
554            hostname  asampath    ghpc_v1  COMPLETED        1  00:00.002       20Gc      1312K      1312K            0            0   00:00:00 
555          template_+  asampath        nav    RUNNING        1   00:00:00       10Gn                                                   00:00:03