Job Scheduling Policy 

The job scheduler allows all groups to have an equal opportunity to run computation, with a job's position in the queue determined by four factors.

  • Age: Jobs that sit in the queue longer get increased priority.
  • Job Size: Jobs with increased CPU or RAM requirements get increased priority.
  • Wall time: Shorter wall times get increased priority.
  • Fair Share: Groups with high overall cluster usage get decreased priority.

 

Why is a job not running?

After submission, a job can appear with a status PD (not running). The most common causes are:

Resources

The cluster is busy and no resources are currently available for your job. Your job will run as soon as the resources requested become available.

Priority

You job cannot currently run because of our fair share policy. Notice that this policy is based on the global usage within your group.

AssocGrpCPURunMinutesLimit

Your group’s allocation has expired or has been spent completely. In case you are planning to submit a proposal for additional resources, you can request a temporary extension of your current SU allocation by submitting a ticket.

Dependency

A job cannot start until another job is finished. This only happens if you included a "--dependency" directive in your SLURM script.

DependencyNeverSatisfied

A job cannot start because another job on which it depends failed. Please cancel this job, as it will never be able to run.

MaxCpuPerAccount

Your account has reached the maximum number of CPUs allowed. You will have to wait until other jobs from your group are complete.

MaxMemoryPerAccount

The job exceeds the current memory quota. The maximum quota available depends on the cluster and partition. The table below gives the maximum memory available (per node) in each partition.

Cluster Partition Max memory in GB
smp smp 251
  legacy 62
  high-mem 995
mpi opa 62
  opa-high-mem 187
  ib 125
gpu gtx1080 125
  titanx 125
  k40 125
  v100 187
  power9 511
htc htc 754