- Getting Started
- Resources
- PROJECTS
- Training
- NEWS
- User Support
- About
- ARC23 Symposium
- Pitt Research
Slurm Workload Manager
The CRC clusters use Slurm for batch job queuing. The sinfo -M command provides an overview of the state of the nodes within the cluster.
“-M” flag for sinfo, scontrol, sbatch and scancel specify what cluster you want to see. By default, without -M flag, all commands refer to the smp cluster.
[shs159@login0 ~]$ sinfo -M smp,gpu,mpi CLUSTER: gpu PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gtx1080* up infinite 5 mix gpu-stage[08-12] gtx1080* up infinite 13 idle gpu-n[16-25],gpu-stage[13-15] titanx up infinite 1 mix gpu-stage01 titanx up infinite 6 idle gpu-stage[02-07] k40 up infinite 1 idle smpgpu-n0 titan up infinite 5 idle legacy-n[126,128-131]
CLUSTER: mpi PARTITION AVAIL TIMELIMIT NODES STATE NODELIST opa* up infinite 3 mix opa-n[80,89-90] opa* up infinite 47 alloc opa-n[0-9,12,15-23,32-37,39-45,60-64,72-75,81-84,86] opa* up infinite 46 idle opa-n[10-11,13-14,24-31,38,46-59,65-71,76-79,85,87-88,91-95] legacy up infinite 20 idle legacy-n[0-19] CLUSTER: smp PARTITION AVAIL TIMELIMIT NODES STATE NODELIST smp* up infinite 6 mix smp-n[42,56-58,63,65] smp* up infinite 3 alloc smp-n[44,46,62] smp* up infinite 91 idle smp-n[24-41,43,45,47-55,59-61,64,66-123] high-mem up infinite 29 idle smp-256-n[1-2],smp-512-n[1-2],smp-n[0-23],smp-nvme-n1
Nodes in the alloc state mean that a job is running. The asterisk next to the partition means that it is the default partition for all jobs.
squeue -M shows the list of running and queued jobs.
The most common states for jobs in squeue are described below. See the output of man squeue or this page for more details.
ABBREVIATION | STATE | DESCRIPTION |
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes. |
CG | COMPLETED | Job is in the process of completing. Some processes on some nodes may still be active. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
PD | PENDING | Job is awaiting resource allocation. |
R | RUNNING | Job currently has an allocation. |
TO | TIMEOUT | Job terminated upon reaching its time limit. |
To see when all jobs are expected to start run squeue --start. See man squeue for a complete description the possible REASONS for pending jobs.
The scontrol output shows detailed job output.
$ scontrol -M <cluster> show job <jobid>
- Note: not all jobs have a definite start time.