Checking on your job(s)
How do I monitor my jobs?
Now that you have submitted job(s), you can check the status of all the jobs in queue, using the alias myst
asampath@console1:[~] > myst
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_MEMORY NODELIST(REASON)
326 nav template asampath R 2:22 1 2 10G sky001.ghpc.au.dk
As you can see, the job with job ID - 326 that I submitted to the queue recently into the partition/queue nav, with the name template as username - asampath, has already started running, indicated by the state 'R' and has been running for 2 minutes and 22 seconds as of now, on 1 node, using 2 CPUs, with a memory limit of 10 GiBs and is being executed on a node identified by sky001.ghpc.au.dk.
On the other hand, if the cluster is busy serving a lot of jobs and your job is waiting for availability of resources it requested, you'd see the status as below.
asampath@console1:[~] > myst
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_MEMORY NODELIST(REASON)
387 nav template asampath PD 0:00 1 1 10G (Resources)
388 nav template asampath PD 0:00 1 1 10G (Priority)
Where, Resources
reason indicates that the job is ready to run and is waiting for resources to be allocated, and Priority
indicates that there are other jobs submitted before this job that needs to be scheduled before this one gets to execution or in other words, this job is waiting in queue.
Note: myst
command is just a system-wide alias written by your sysadmin for the command - squeue -u $(whoami) -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.5C %.10m %R"
so that it is easier to use.
Where is the output?
After a job completes (or sometimes during an execution of a job with several steps), you would want to see its output. With SLURM, you control where the output goes. Refer to your job script for the "--output" and "--error" sections to understand where the result files are written to.
In the example, we instructed SLURM to write a file "slurm_%A.out" in the home directory for stdout and "slurm_%A.err " for stderr.
asampath@console1:[~] > ls -alh slurm_326*
-rw-r--r--. 1 asampath qgg 0 Oct 28 15:10 slurm_326.err
-rw-r--r--. 1 asampath qgg 150 Oct 28 15:14 slurm_326.out
If you copied the results from execution of commands to some specific location in your job script, then check those locations for the results.
How can I see my completed jobs?
At times, you need to look at your completed jobs and check their resource usage details so as to model the job scripts of similar jobs. You can get the details of all your jobs using the sacct
command. However, your sysadmin has made a system-wide aliases called sj
and saj
to help you find the details about any particular job or all jobs you have ever submitted.
Details of a particular job:
If you want to know the details of a particular job - identified by its job ID, you can use the sj command as shown below.
asampath@console1:[~] > sj 430
JobID JobName Group ReqCPUS TotalCPU ReqMem MaxRSS AveRSS MaxDiskRead MaxDiskWrite Elapsed
------------ ---------- --------- -------- ---------- ---------- ---------- ---------- ------------ ------------ ----------
430 template_+ qgg 1 00:05.742 10Gn 00:10:00
430.batch batch 2 00:05.742 10Gn 20216K 20216K 0.04M 0.15M 00:10:00
Where,
the job requested 2 cpus, 10 GiB of memory, and used 20216 KiB (19.74 MiB) of memory at peak, wrote to disk at a rate of 0.15 MiB/s and ran for 10 minutes. This result indicates that if the job were to be resubmitted, it would be sufficient to request for just 1 cpu and 30 MiB of memory.
Details of a job while it is running:
If you need to see the status of a job that is currently running(to see how much memory it is using), you can use the alias - srj (status of a running job).
navtp@console1:~> srj 860
JobID MaxRSSNode AveRSS MaxRSS MaxVMSize MinCPU
------------ ---------- ---------- ---------- ---------- ----------
860.0 sky1 2508K 69036K 534412K 00:02.000
Details of all the jobs you ever ran:
If you dont know the job ID, and would like to query all your jobs, you can use the out of saj
as a source. Note: It can be exhaustively long if you have ran a lot of jobs.
asampath@console1:[~] > saj
JobID JobName Group ReqCPUS TotalCPU ReqMem MaxRSS AveRSS MaxDiskRead MaxDiskWrite Elapsed
------------ ---------- --------- -------- ---------- ---------- ---------- ---------- ------------ ------------ ----------
326 template_+ qgg 1 00:05.159 10Gn 00:04:14
326.batch batch 2 00:05.159 10Gn 20576K 20576K 0.87M 0.60M 00:04:14
327 template_+ qgg 1 00:06.370 10Gn 00:10:26
327.batch batch 2 00:06.370 10Gn 20216K 20216K 0.04M 0.20M 00:10:26
328 template_+ qgg 1 00:05.504 10Gn 00:10:23
328.batch batch 2 00:05.504 10Gn 20216K 20216K 0.04M 0.18M 00:10:23
329 template_+ qgg 1 00:07.223 10Gn 00:10:23
329.batch batch 2 00:07.223 10Gn 20216K 20216K 0.04M 0.17M 00:10:23
330 template_+ qgg 1 00:07.238 10Gn 00:10:23
330.batch batch 2 00:07.238 10Gn 20216K 20216K 0.04M 0.17M 00:10:23
Trimmed for brevity
You can pipe its output to grep to filter information, or to less to page through this information.
Details of all the jobs you ran in the last 24 hours:
Another handy alias to list all jobs that were initiated in the last 24 hours (running, queued, completed, failed, cancelled). You can use the MaxRSS section to understand the peak memory usage of your job to model similar jobs in future.
asampath@console1:[~] > sajt
JobID JobName User Partition State ReqCPUS TotalCPU ReqMem MaxRSS AveRSS MaxDiskRead MaxDiskWrite Elapsed
------------ ---------- --------- ---------- ---------- -------- ---------- ---------- ---------- ---------- ------------ ------------ ----------
553 template_+ asampath nav RUNNING 1 00:00:00 10Gn 00:00:32
554 hostname asampath ghpc_v1 COMPLETED 1 00:00.002 20Gc 1312K 1312K 0 0 00:00:00
555 template_+ asampath nav RUNNING 1 00:00:00 10Gn 00:00:03