- About
- Resources
- PROJECTS
- Training
- NEWS
- User Support
- User Support
- Accounts, Allocations, Data Storage, Documentation
- Create an Account
- PI request to add a user
- Request over 10,000 Service Units
- Allocation Proposal Guidelines
- Data Storage Guidelines
- Hardware Investing Policy
- Resource Documentation
- Installed Software
- Introduction to Linux
- Advanced HTC Support
- People
- CONTACT
Slurm Workload Manager
The H2P cluster uses Slurm for batch job queuing. The sinfo -M command provides an overview of the state of the nodes within the cluster.
“-M” flag for sinfo, scontrol, sbatch and scancel specify what cluster you want to see. By default, without -M flag, all commands refer to the smp cluster.
[shs159@login0 ~]$ sinfo -M smp,gpu,mpi CLUSTER: gpu PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gtx1080* up infinite 5 mix gpu-stage[08-12] gtx1080* up infinite 13 idle gpu-n[16-25],gpu-stage[13-15] titanx up infinite 1 mix gpu-stage01 titanx up infinite 6 idle gpu-stage[02-07] k40 up infinite 1 idle smpgpu-n0 titan up infinite 5 idle legacy-n[126,128-131]
CLUSTER: mpi PARTITION AVAIL TIMELIMIT NODES STATE NODELIST opa* up infinite 3 mix opa-n[80,89-90] opa* up infinite 47 alloc opa-n[0-9,12,15-23,32-37,39-45,60-64,72-75,81-84,86] opa* up infinite 46 idle opa-n[10-11,13-14,24-31,38,46-59,65-71,76-79,85,87-88,91-95] legacy up infinite 20 idle legacy-n[0-19] CLUSTER: smp PARTITION AVAIL TIMELIMIT NODES STATE NODELIST smp* up infinite 6 mix smp-n[42,56-58,63,65] smp* up infinite 3 alloc smp-n[44,46,62] smp* up infinite 91 idle smp-n[24-41,43,45,47-55,59-61,64,66-123] high-mem up infinite 29 idle smp-256-n[1-2],smp-512-n[1-2],smp-n[0-23],smp-nvme-n1
Nodes in the alloc state mean that a job is running. The asterisk next to the partition means that it is the default partition for all jobs.
squeue -M shows the list of running and queued jobs.
The most common states for jobs in squeue are described below. See man squeue for more details.
ABBREVIATION | STATE | DESCRIPTION |
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes. |
CG | COMPLETED | Job is in the process of completing. Some processes on some nodes may still be active. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
PD | PENDING | Job is awaiting resource allocation. |
R | RUNNING | Job currently has an allocation. |
TO | TIMEOUT | Job terminated upon reaching its time limit. |
To see when all jobs are expected to start run squeue --start. See man squeue for a complete description the possible REASONS for pending jobs.
The scontrol output shows detailed job output.
$ scontrol -M <cluster> show job <jobid>
- Note: not all jobs have a definite start time.