Slurm Workload Manager

The H2P cluster uses Slurm for batch job queuing. The  sinfo -M command provides an overview of the state of the nodes within the cluster. 

“-M” flag for sinfo, scontrol, sbatch and scancel specify what cluster you want to see. By default, without -M flag, all commands refer to the smp cluster.

[shs159@login0 ~]$ sinfo -M smp,gpu,mpi
CLUSTER: gpu

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gtx1080*     up   infinite      5    mix gpu-stage[08-12]
gtx1080*     up   infinite     13   idle gpu-n[16-25],gpu-stage[13-15]
titanx       up   infinite      1    mix gpu-stage01
titanx       up   infinite      6   idle gpu-stage[02-07]
k40          up   infinite      1   idle smpgpu-n0
titan        up   infinite      5   idle legacy-n[126,128-131]
CLUSTER: mpi
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
opa*         up   infinite      3    mix opa-n[80,89-90]
opa*         up   infinite     47  alloc opa-n[0-9,12,15-23,32-37,39-45,60-64,72-75,81-84,86]
opa*         up   infinite     46   idle opa-n[10-11,13-14,24-31,38,46-59,65-71,76-79,85,87-88,91-95]
legacy       up   infinite     20   idle legacy-n[0-19]

CLUSTER: smp
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
smp*         up   infinite      6    mix smp-n[42,56-58,63,65]
smp*         up   infinite      3  alloc smp-n[44,46,62]
smp*         up   infinite     91   idle smp-n[24-41,43,45,47-55,59-61,64,66-123]
high-mem     up   infinite     29   idle smp-256-n[1-2],smp-512-n[1-2],smp-n[0-23],smp-nvme-n1

Nodes in the alloc state mean that a job is running. The asterisk next to the partition means that it is the default partition for all jobs.

squeue -M shows the list of running and queued jobs.

The most common states for jobs in squeue are described below. See man squeue for more details.

ABBREVIATION STATE DESCRIPTION
CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes.
CG COMPLETED Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED Job terminated with non-zero exit code or other failure condition.
PD PENDING Job is awaiting resource allocation.
R RUNNING Job currently has an allocation.
TO TIMEOUT Job terminated upon reaching its time limit.

To see when all jobs are expected to start run squeue --start. See man squeue for a complete description the possible REASONS for pending jobs.

The scontrol output shows detailed job output.

$ scontrol -M <cluster> show job <jobid>
  • Note: not all jobs have a definite start time.