Resource limits
limit / Queue > | GHPC | ZEN4 | nav_zen4 | nav |
---|---|---|---|---|
Max number of CPU cores that can be requested by a job | 32 | 128 | 128 | 32 |
Default number of CPU cores assigned to a job if not specified by user | 2 | 2 | 2 | 2 |
Max amount of memory that can be requested by a job | 740 GiB | 1.5 TiB | 1.5 TiB | 385 GiB |
Default amount of memory assigned to a job if not specified by user | 11.7 GiB/core | 11.75 GiB/core | 11.7 GiB/core | 11.75 GiB/core |
Fair usage limits:
As resourceful as the cluster is, it is unfair for a single user to overwhelm the resource pool att he cost of other users's requests. Hence fair usage limits are put in place. The following limits apply to all users by default. If you reach this limit your further jobs will be made to wait in queue until your prior jobs complete, leaving their occupied resources back to the pool.
Maximum # of CPU cores a user can utilise as part of their running jobs = 72
Maximum amount of memory a user can reserve at any point in time = 768 GiB
Maximum number of jobs a user can have (running + pending) in the system at a time = 144
What if a user hits one of the limits above?
Their jobs will be queued and will get a chance to run only after their currently running jobs relinquish the resources so that the limits could still be satisfied.
For example, if a job is made to wait because a user's memory limit, it would show up like below.
asampath@c07b12:[~] > myst
JOBID PARTITION NAME USER ST TIME_LIMIT TIME NODES CPUS MIN_MEMORY NODELIST(REASON)
3945 ghpc bash asampath PD 12:00:00 0:00 1 1 220G (QOSMaxMemoryPerUser)
How do I know if I hit any of the limits?
myst
and squeue
commands will clearly state why your jobs are pending and what limits they are waiting to satisfy.
What if I need an exception?
Write an email to your sysadmin and give a convincing reason why you need extra resources.