CRC implements cgroup based resource management | crc.pitt.edu

Prior to the maintenance, jobs that were scheduled on a single node in the SMP or HTC cluster shared a pool of memory for that node. For example, one of the AMD EPYC 9374F nodes within the smp partition of the SMP cluster provides up to 768GB of RAM for all jobs running on that node. Even a single core job on this node can address up to the maximum memory without explicitly specifying the memory requirement in the job submission script. Consequently, some users would encounter OUT OF MEMORY errors although they had explicitly requested sufficient memory in their job submission script. This is because of other jobs that are running on the same node addressing more memory than expected, since the node was not configured to enforce runtime memory usage limits.

To mitigate such scenarios, we have implemented cgroup-based resource management within the CRC ecosystem. What this means is the memory per core (Mem/Core) will be enforced for each job. A job submission script requesting a single core without explicitly specifying the memory will be allocated the default memory (roughly, it is equal to the total memory divided by the total cores in the node). You can use the computational resource page to estimate how many cores you will need to scale up in order to obtain the memory that you need. Alternatively, if you only require a single core but need more memory, you can use the --mem option in your job submission script:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cluster=smp
#SBATCH --partition=smp
#SBATCH --mem=64g

After the maintenance, the opposite is happening: jobs that did not explicitly and accurately request a particular memory amount are failing and failing without an error message in the output log. Jobs that explicitly and accurately request memory now run without the possibility of an errant job impacting it.

If jobs that ran successfully prior to the maintenance are now crashing, you can use the following command below to see if you need to specify more memory:

sacct -u $USER --starttime 2024-01-15 --format=JobID,Account,partition,state%20,time,elapsed,ncpus,nodelist

where the value to the --starttime parameter is in the year-month-day format. If your output looks like the following below, you are encountering the OUT OF MEMORY error and need to request more memory resources by either scaling up the number of requested cores or explicitly requesting the required memory for your job:

JobID           Account  Partition           State  Timelimit    Elapsed      NCPUS        NodeList
------------ ---------- ---------- --------------- ---------- ---------- ---------- ---------------
13228919       kwong           smp   OUT_OF_MEMORY   03:00:00   00:13:42          1        smp-n214
13228919.ba+   kwong                 OUT_OF_MEMORY              00:13:42          1        smp-n214
13228919.ex+   kwong                 OUT_OF_MEMORY              00:13:42          1        smp-n214

Cgroup-based resource management also applies to the GPU cluster. Previously, jobs that are allocated on a GPU node can see all the GPUs on that node. This setup led to instances where jobs that are assigned a single GPU can address other GPUs, thus impacting the execution of other jobs. Now, each job will only see the GPUs that it was assigned and cannot use resources beyond what the batch scheduler allocates. As an example, to request 3 GPUs for you AI/ML training, you will need to explicitly specify this value in your job submission script:

#SBATCH --gres=gpu:3

Resource management based on cgroups extends beyond GPUs to include other resources like CPUs and memory in the GPU cluster. If you know the CPU and/or memory requirements of your program or job, it's crucial to specify them in your submission script. This ensures that SLURM allocates the necessary resources accordingly. Failure to do so will result in SLURM assigning default values for CPUs and memory, compelling the program or job to comply with these defaults. This could potentially lead to OUT OF MEMORY errors, as mentioned earlier.

In conclusion, CRC has now implemented cgroup-based resource management within our computing ecosystem. Cgroup-based resource management is good practice, leading to process isolation that mitigates jobs running on shared resources impacting each other. It is also good practice because the user is guaranteed the requested resource without concern about other jobs exceeding their share.

Tuesday, January 23, 2024