New GPU PREEMPT Partition Announcement | crc.pitt.edu

Submitted by yak73 on Mon, 02/26/2024 - 17:10

Dear CRC Users,

Presently, our GPU cluster's A100_MULTI partition hosts 10 high-powered computing nodes, each equipped with 4 A100 GPUs. This setup is tailor-made for tackling the most demanding computational tasks across multiple nodes and GPUs and it necessitates a minimum of 2 nodes/8 GPUs for jobs to run. Yet, despite its robust capabilities, the partition sometimes sits idle due to limited number of research groups having jobs that meet the minimum node/GPU requirements.

To maximize resource utilization, we're excited to introduce PREEMPT—a dynamic new partition open to all Pitt CRC users. Much like its counterpart, PREEMPT utilizes the same high-performance nodes found in the A100_MULTI partition but without imposing minimum node/GPU constraints.

It is important to note that jobs running on PREEMPT are subject to preemption. This means they may be terminated and requeued whenever a job meeting the minimum requirements of A100_MULTI is submitted. Preempted jobs are placed back in the queue and resumed once resources become available. Please be aware that utilizing PREEMPT will not incur charges and will not be deducted from your group's allocation. To ensure your job is directed to the PREEMPT partition, simply specify “preempt” for the "--partition" parameter in your submission script.

Below is a template of a submission script that can be used for submission to the PREEMPT partition on the GPU cluster:

#!/bin/bash
#SBATCH --job-name=my_awesome_gpu_job
#SBATCH --cluster=gpu
#SBATCH --partition=preempt
#SBATCH --nodes=1                        # node count
#SBATCH --ntasks-per-node=1              # total number of tasks per node
#SBATCH --cpus-per-task=16                # cpu-cores per task (>1 if multi-threaded tasks)<
#SBATCH --mem=256G                        # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:1                     # number of gpus per node
#SBATCH --time=3-00:00:00                  # total run time limit (DD-HH:MM:SS)
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

If you experience any difficulties or require assistance in tailoring your job script, kindly submit a support ticket.

Thank you,

The CRC Team