Slurm Workload Manager

The HTC cluster uses Slurm for batch job queuing. 16 compute nodes belong to the htc partition and it is the default partition. The sinfo command provides an overview of the state of the nodes within the cluster.

[fangping@login0a ~]$ sinfo


htc*         up 6-00:00:00      4    mix n[410,413,417,427]

htc*         up 6-00:00:00     16  alloc n[409,411-412,414-416,418-426,428]

Nodes in the alloc state mean that a job is running. The asterisk next to the htc partition means that it is the default partition for all jobs.

squeue shows the list of running and queued jobs.

The most common states for jobs in squeue are described below. See man squeue for more details.

Abbreviation State Description
CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes.
CG COMPLETING Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED Job terminated with non-zero exit code or other failure condition.
PD PENDING Job is awaiting resource allocation.
R RUNNING Job currently has an allocation.
TO TIMEOUT Job terminated upon reaching its time limit.

See man squeue for a complete description the possible REASONS for pending jobs.

  • Note: If your job can not be killed with or the scancel command, it may be stuck in the COMPLETING state due to performing an I/O operation. Submit a ticket requesting to bring the node to a DOWN state and then back up again.   

To see when all jobs are expected to start run squeue --start.

The scontrol output shows detailed job output.

scontrol show job <jobid>
  • Note: Not all jobs have a definite start time.