Slurm Batch Jobs

Slurm "Jobs" are an allocation of compute resources assigned to your CRC user account for some amount of time by the Slurm Workload Manager. Once allocated, you can use those resources to run commands and process data. 

A "Batch Job" is a type of job that is specified fully by a .slurm submission script. This script contains information about the resources you need and the commands to be run once those resources are allocated to you. As opposed to interactive sessions, batch jobs run unattended. Once queued, your job you can use slurm commands like squeue and sacct to check the job's status.

Submitting a batch job

When you access the clusters, you will start on a log in node. Your data processing should not be run here.

[nlc60@login0b ~] : hostname
login0b.htc.sam.pitt.edu

Use a batch job to recieve an allocation of compute resources and have your commands run there. 

sbatch is the slurm function to submit a script or .slurm submission script as a batch job. Here is a simple example, submitting a bash script as a batch job.

[nlc60@login0b ~] : cat hello_world.sh
#!/bin/bash
echo "hello world"
crc-job-stats.py

[nlc60@login0b ~] : sbatch hello_world.sh
Submitted batch job 946479

[nlc60@login0b ~] : cat slurm-946479.out
hello world
============================================================================
                               JOB STATISTICS
============================================================================
      SubmitTime: 2022-06-17T12:41:39
         EndTime: 2022-06-17T13:41:39
         RunTime: 00:00:00
           JobId: 946479
            TRES: cpu=1,mem=8000M,node=1,billing=1
       Partition: htc
        NodeList: htc-n0
         Command: /ihome/sam/nlc60/hello_world.sh
          StdOut: /ihome/sam/nlc60/slurm-946479.out
More information:
    - `sacct -M htc -j 946479 -S 2022-06-17T12:41:39 -E 2022-06-17T13:41:39`
   Print control:
    - List of all possible fields: `sacct --helpformat`
    - Add `--format=<field1,field2,etc>` with fields of interest
============================================================================

Note that there is a call to the 'crc-job-stats.py' wrapper function. It is always a useful addition to a script submitted to slurm, as it will show you how to find more information if something goes wrong. 

 

Specifying job requirements 

You can add more detail to the requirements of your job by providing sbatch arguments. For example:

[nlc60@login0b ~] : sbatch --nodes=1 --time=0-00:1:00 --ntasks-per-node=1 --qos=short --cluster=smp --partition=smp hello_world.sh

The benefit of using a .slurm submission script is that these arguments are more easily read and reproduced.

Converting the above example from hello_world.sh to hello_world.slurm and running:

[nlc60@login0b ~] : cat hello_world.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=0-00:01:00
#SBATCH --ntasks-per-node=1
#SBATCH --qos=short
#SBATCH --cluster=smp
#SBATCH --partition=smp
echo "hello world"
crc-job-stats.py

​The slurm submission script defines the additional arguments as lines starting with '#SBATCH'. 

sbatch will stop processing further arguments once the first non-comment, non-whitespace line has been reached in the script. 

Below are a subset of sbatch arguments that can be used to specify a job on the cluster. You do not need to include them all

Please refer to the slurm documentation on sbatch for more options.

Argument Description Format 
  Common Agruments  
--job-name The job name. This will appear when you check the status with squeue. Something descriptive enough to easily identify and differentiate jobs with. Default is the JobID.
--nodes Maximum number of nodes to be used. Usually 1, MPI requires a minimum of 2. Default is 1.
--ntasks, --ntasks-per-node Specify the maximum number of tasks to be launched per node. Default is 1.
--cluster The cluster that the job will run on. smp, mpi, gpu, htc
--partition The partition of the cluster that the job will run on. See Node Configurations
--time Define the max walltime required for the job The format is days-hh:mm:ss.
--qos Declare the Quality of Service to be used.
The default is normal. You need to specify `long` if the walltime is greater than 3 days.
short, normal, long
--error File to redirect standard error. full path or filename to be written to working directory
--mem Memory limit per compute node Memory in MB
  User Notification  
--mail-user Email address for notifications PittID@pitt.edu
--mail-type Conditions for sending notifications END for when the job finishes, FAIL for if the job fails while running.
  Use-case / Cluster Specific Arguments  
--cpus-per-task Advise the SLURM controller that ensuing job steps will require ncpus number of processors per task. Use this to facillitate multithreading.  
--gres Specify usage of a general resource. This is used on GPU cluster jobs to indicate number of cards needed. --gres=gpu:1 for 1 card. Required when submitting to the GPU cluster.

 

Running the .slurm submission script:

[nlc60@login0b ~] : sbatch hello_world.slurm
Submitted batch job 6041483 on cluster smp

[nlc60@login0b ~] : cat slurm-6041483.out
hello world
============================================================================
                               JOB STATISTICS
============================================================================
      SubmitTime: 2022-06-17T13:19:29
         EndTime: 2022-06-17T13:20:30
         RunTime: 00:00:00
           JobId: 6041483
            TRES: cpu=1,mem=4018M,node=1
       Partition: smp
        NodeList: smp-n26
         Command: /ihome/sam/nlc60/hello_world.slurm
          StdOut: /ihome/sam/nlc60/slurm-6041483.out
More information:
    - `sacct -M smp -j 6041483 -S 2022-06-17T13:19:29 -E 2022-06-17T13:20:30`
   Print control:
    - List of all possible fields: `sacct --helpformat`
    - Add `--format=<field1,field2,etc>` with fields of interest
============================================================================

 

A more complex example

Below is a more abstracted example that loads some modules from the module systems, copies inputs and outputs, etc.

#!/bin/bash                                     
#SBATCH --job-name=<job_name>                  
#SBATCH --nodes=<number of nodes> 
#SBATCH --ntasks-per-node=<tasks per node> 
#SBATCH --cluster=<cluster name> 
#SBATCH --partition=<partition>            
#SBATCH --mail-user=<user_ID>@pitt.edu    
#SBATCH --mail-type=END,FAIL               
#SBATCH --time=<days-HH:MM:SS>                  
#SBATCH --qos=<qos>                         

module purge                                    
module load module1 module2 

cp <inputs> $SLURM_SCRATCH                      
cd $SLURM_SCRATCH  
run_on_exit(){ cp -r $SLURM_SCRATCH/* $SLURM_SUBMIT_DIR 
} 
trap run_on_exit EXIT 

srun <job executable with parameters>           

crc-job-stats.py                                 

cp <outputs> $SLURM_SUBMIT_DIR                  

 

Specify the interpreter

A shebang (#!) line must be present. The shebang line can call any shell or scripting language available on the cluster.

For example, #!/bin/bash, #!/bin/tcsh, #!/bin/env python or #!/bin/env perl.

 

The sbatch arguments are provided in a batch script by preceding them with #SBATCH.  The resource specific arguments (ntasks, mem, nodes, time) specify the limits of your jobs resources. 

 

Module loading 

You'll need to specify which modules your job requires. See the module system page for more details on searching for available software.

It's always a good idea to perform a `module purge` first to make sure the environment is clean.

 

Input Handling

After you load your modules, you can automate any other setup you need to adjust the job's running environment or get your input ready. 

In the case of the example, this is copying input data to a scratch location, and setting up a trap to copy temp files and intermediate outputs.

By default, the working directory of your job is the directory from which the batch script was submitted. You can use the sbatch argument `--chdir` to adjust this.

 

Start Parallel Job with srun

srun starts your job.also takes the --nodes, --tasks-per-node and --cpus-per-task arguments to allow each job step to change the utilized resources but they cannot exceed those given to sbatch.

 

Report Job Statistics 

Add a call to the `crc-job-stats.py` wrapper script to display statistics for your job. 

 

Output Handling

Automate any manipulation of the job's output files at the end of the script.

 

Interacting with your Job after Submission

scancel - Cancel a job, job array, or job step.

squeue - View information about jobs in the queue. 

 

F.A.Q

 

Q: Where can I find more specific examples of these batch scripts?

A: Example Jobs utilizing commonly loaded modules can be found in

/ihome/crc/how_to_run

For users performing NGS analyses on HTC, see Dr. Fangping Mu's extensive notes on this page

 

Q: I'm confused by the interaction between nodes, tasks, and cpus-per-task. How does that work?

A: Slurm is very explicit in how one requests resources. 

A node is a compute node in the cluster. The default partition of the SMP cluster has 100 nodes for example. 

The term task in this context can be thought of as a “process”, and is related to the number of CPUs/cores you request.

Say you've specifed --ntasks=16.

The first case is that it's for a HTC/SMP/GPU job with 16 independent processes. The implicit configuration here is --nodes=1, --ntasks=16 => --ntasks-per-node=16

 

A second case is that you are running a multi-process program on MPI. (--nodes=< a number 2 or greater >)

--ntasks=16 alone means your program will create a maximum of 16 processes, but you dont care how the cores are distributed.

You can use --ntasks-per-node to get more specific about what number of tasks you want running across the number of nodes you're requesting.

 

Another case is that you have a multi-threaded program, that is a single task, but can use multiple cores.

--ntasks=1, --cpus-per-task=16

This is using --cpus-per-task to specify that you have 1 task that can be completed by multithreading across 16 cores.

On HTC, SMP, and GPU, individual tasks cannot be split across multiple compute nodes, so requesting a number of CPUs with --cpus-per-task flag will always result in all your CPUs allocated on the same compute node.

 

Q: Slurm doesn't seem to be taking my .bashrc additions into account. How do I prompt this?

A: Slurm does not source ~/.bashrc or ~/.profile by default. If your job submission requires environment setting specified by these files, you should include

source ~/.bashrc

in your job submiussion script after any `module load ...` commands.

 

Q: I want to submit a job, but I am unsure what allocation I am drawing resources from. How do I check this?

A: This can be determined with the `sacctmgr` (slurm account manager) command. 

sacctmgr show associations onlydefaults | grep USERNAME

Where USERNAME is your CRC username. 

The output will show, for each cluster (first column), which allocation (second column, refered to as GROUPNAME below) is charged from by default.

If your user account is associated with multiple PI compute resource allocations, you can run the command above without the `onlydefaults` argument to list all of them, and then specify which one your job will charge with the '-A' or '--account=' arguments followed by the group name. 

#SBATCH --account=GROUPNAME # Charge GROUPNAME instead of the default.

 

Q: My analyses require that I run the same job on many different input files, or the same input with many different sets of parameters. Is there a better way to submit collections of jobs?

A: Yes, please see this documentation on submitting multiple jobs.