Slurm Workload Manager
The HTC cluster uses Slurm for batch job queuing. 16 compute nodes belong to the htc partition and it is the default partition. The sinfo command provides an overview of the state of the nodes within the cluster.
[fangping@login0a ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST htc* up 6-00:00:00 4 mix n[410,413,417,427] htc* up 6-00:00:00 16 alloc n[409,411-412,414-416,418-426,428]
Nodes in the alloc state mean that a job is running. The asterisk next to the htc partition means that it is the default partition for all jobs.
squeue shows the list of running and queued jobs.
The most common states for jobs in squeue are described below. See man squeue for more details.
|CA||CANCELLED||Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.|
|CD||COMPLETED||Job has terminated all processes on all nodes.|
|CG||COMPLETING||Job is in the process of completing. Some processes on some nodes may still be active.|
|F||FAILED||Job terminated with non-zero exit code or other failure condition.|
|PD||PENDING||Job is awaiting resource allocation.|
|R||RUNNING||Job currently has an allocation.|
|TO||TIMEOUT||Job terminated upon reaching its time limit.|
See man squeue for a complete description the possible REASONS for pending jobs.
To see when all jobs are expected to start run squeue --start.
The scontrol output shows detailed job output.
scontrol show job <jobid>
- Note: not all jobs have a definite start time.