H2P Cluster

The H2P (Hail to Pitt!) is the main computational cluster at the Center for Research Computing. In this document, we present all the information you need about H2P.

Here is the table of contents: 

  • Access to H2P
    • Off-campus access (setting up the VPN)
    • On-campus access
  • Node configuration
  • Application environment
    • Installed packages
  • Slurm Workload Manager
    • Slurm jobs
    • Service unit
    • Fair share and priority
    • Example batch scripts
    • PBS to Slurm commands
  • CRC wrappers

Access to H2P

CRC computational resources are housed off campus at the University’s main data center. CRC clusters are firewalled, so you can not directly access them when you are off-campus.

Off-campus access

If you are off-campus, the cluster is accessible securely from any where in the world via a Virtual Private Network (VPN), a service of CSSD. The VPN requires certain software to run on your system, and multiple alternatives are available in order to cover almost all systems and configurations.

Windows and Mac

Download/Install Pulse VPN and follow the instruction as follows:

pulse1.png  pulse2.png

pulse3.png  pulse4.png

pulse5.png  pulse6.png

pulse7.png

 

pulse9.pngpulse8.png

Linux

VPNC is a commandline VPN application which may be the most convenient way to connect for some Linux users.

Most distributions provide prebuild binaries, or you can get the source and install your own:

Once installed, download the configuration file here (requires login) and move the file to /etc/vpnc/pitt.conf. Then:

  • Adding your username and password and delete YourPittUsername_HERE and YourPittPassword_HERE.
  • Run sudo vpnc pitt, to stop, run sudo vpnc-disconnect
    • Disconnect will kill most recent vpnc
    • Kill all of them with sudo killall vpnc

On-campus access

To use CRC resources, users must first have a valid Pitt ID and formally request an account. Once you have valid login credentials, the clusters can be accessed via SSH. For example to connect to H2P:

$ ssh pittID@h2p.crc.pitt.edu

Your username is your PittID and your password is the same as your campus-wide Pitt password.

Windows

Download/Install Xming or Putty (use Windows MSI installer package). Putty setup instruction provided in the following snapshots.

Mac and Linux

Open your favorite terminal emulator

Node configuration

H2P cluster is separated into 3 clusters:

  1. Shared Memory Parallel (smp): Meant for single node jobs.
  2. Graphics Processing Unit (gpu): The GPU partition made up of Titan, Titan X, K40  and 1080 GTX nodes.
  3. Distributed Memory (mpi): The multi-node partition. Meant for massively parallel Message Passing Interface (MPI) jobs.
  • cluster= smp (default)
    • partition= smp (default)
      • 100 nodes of 24-core Xeon Gold 6126 2.60 GHz (Skylake)
      • 192 GB RAM
      • 256 GB SSD & 500 GB SSD
      • 10GigE
    • partition= high-mem
      • 29 nodes of 12-core Xeon E5-2643v4 3.40 GHz (Broadwell)
      • 256 GB RAM and 512GB RAM
      • 256 GB SSD & 1 TB SSD
      • 10GigE
  • cluster= gpu
    • Make sure to ask for a GPU! (--gres=gpu:N where N is the number of GPUs you need)
    • partition= gtx1080 (default)
      • 10 nodes with 4 GTX1080Ti
      • 8 nodes with 4 GTX1080
      • partition= titanx

        • 7 nodes with 4 Titan X
      • partition= k40
        • 1 node with 2 K40
      • partition= titan
        • 5 nodes with 4 Titan
    • cluster= mpi
      • partition= opa (default)
        • 96 nodes of 28-core Intel Xeon E5-2690 2.60 GHz (Broadwell)
        • 64 GB RAM/node
        • 256 GB SSD
        • 100 Gb Omni-Path
      • partition= ib
        • 32 nodes of 20-core Intel Xeon E5-2660 2.60 GHz (Haswell)
        • 128 GB RAM/node
        • 56 Gb FDR
      • partition= legacy (Moved over nodes from Frank)
        • 88 nodes of 16-core Intel Xeon E5-2650 2.60GHz
        • 64 GB RAM/node
        • 56 Gb FDR
        • Use --constraint=<feature>, where <feature> could be ivy, sandy, or interlagos

    Application environment

    Lmod will be used by cluster administrators to provide optimized builds of commonly used software. Applications are available to users through the Lmod modular environment commands. There are no default modules loaded when you log in.

    Installed packages

    To find the module you are looking for, simply use the spider command that Lmod offers:

    $ module spider intel
    

    you find two versions of intel compilers. Be aware which version has the compatibility with your code and load the one you need accordingly. There is also another module with “intel” in the module name. You can find all the possibilities of having the word “intel” in the name of the modules by:

    module -r spider '.*intel.*'

    If you need a MPI version of the intel compilers, the

    intel-mpi: intel-mpi/2017.3.196

    is the version you need.

    For saftey, Lmod allows only one version of the package to be available. So if a user does:

    $ module load intel/2011.12.361
    $ module load intel/2017.1.132

    First, intel 2011 version will be loaded, second the intel 2017 will be loaded so it unloads the 2011 version.

    To unload a module a user simply does

    $ module unload package1 package2 ...

    To unload all modules a user simply does

    $ module purge

    In the example below I have loaded intel and intel-mpi modules as prerequisits to load the vasp package. The executables, such as vasp_std, vasp_std,vasp_gamand vasp_ncl are now in my PATH.

    [shs159@login1 ~]$ module load vasp
    Lmod has detected the following error: These module(s) exist but cannot be loaded as
    requested: "vasp"
    
     Try: "module spider vasp" to see how to load the module(s).
    
    [shs159@login1 ~]$ module load intel/2017.1.132
    [shs159@login1 ~]$ module load intel-mpi/2017.1.132
    [shs159@login1 ~]$ module load vasp
    vasp vasp/5.4.1 vasp-vtst vasp-vtst/5.4.1
    [shs159@login1 ~]$ module load vasp/5.4.1
    [shs159@login1 ~]$ vasp_
    vasp_gam vasp_ncl vasp_std
    

    You can check which modules are “loaded” in your environment by using the command module list

    [shs159@login1 ~]$ module list
    
    Currently Loaded Modules:
     1) intel/2017.1.132 2) intel-mpi/2017.1.132 3) vasp/5.4.1

    Slurm Workload Manager

    The H2P cluster uses Slurm for batch job queuing. The  sinfo -M command provides an overview of the state of the nodes within the cluster. “-M” flag for sinfo, scontrol, sbatch and scancel specify what cluster you want to see. By default, without -M flag, all commands refer to the smp cluster.

    [shs159@login0 ~]$ sinfo -M smp,gpu,mpi
    CLUSTER: gpu
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gtx1080*     up   infinite      8    mix gpu-stage[08-15]
    titanx       up   infinite      6    mix gpu-stage[01-04,06-07]
    titanx       up   infinite      1   idle gpu-stage05
    k40          up   infinite      1   idle smpgpu-n0
    titan        up   infinite      1  down* n384
    
    CLUSTER: mpi
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    opa*         up   infinite      3    mix opa-n[80,89-90]
    opa*         up   infinite     47  alloc opa-n[0-9,12,15-23,32-37,39-45,60-64,72-75,81-84,86]
    opa*         up   infinite     46   idle opa-n[10-11,13-14,24-31,38,46-59,65-71,76-79,85,87-88,91-95]
    legacy       up   infinite     20   idle legacy-n[0-19]
    
    CLUSTER: smp
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    smp*         up   infinite      6    mix smp-n[42,56-58,63,65]
    smp*         up   infinite      3  alloc smp-n[44,46,62]
    smp*         up   infinite     91   idle smp-n[24-41,43,45,47-55,59-61,64,66-123]
    high-mem     up   infinite     29   idle smp-256-n[1-2],smp-512-n[1-2],smp-n[0-23],smp-nvme-n1
    

    Nodes in the alloc state mean that a job is running. The asterisk next to the partition means that it is the default partition for all jobs.

    squeue -M shows the list of running and queued jobs.

    The most common states for jobs in squeue are described below. See man squeue for more details.

    ABBREVIATION STATE DESCRIPTION
    CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
    CD COMPLETED Job has terminated all processes on all nodes.
    CG COMPLETED Job is in the process of completing. Some processes on some nodes may still be active.
    F FAILED Job terminated with non-zero exit code or other failure condition.
    PD PENDING Job is awaiting resource allocation.
    R RUNNING Job currently has an allocation.
    TO TIMEOUT Job terminated upon reaching its time limit.

    To see when all jobs are expected to start run squeue --start. See man squeue for a complete description the possible REASONS for pending jobs.

    The scontrol output shows detailed job output.

    $ scontrol -M <cluster> show job <jobid>
    • Note: not all jobs have a definite start time.

    Slurm Jobs

    The three most important commands in Slurm are sbatch, srun and scancelsbatch is used to submit a job script to the queue like the one below, called example. sbatch srun is used to run parallel jobs on compute nodes. Jobs can be canceled with scancel.

    #!/bin/bash
    #SBATCH --job-name=<job_name>
    #SBATCH --nodes=<number of nodes> #number of nodes requested
    #SBATCH --ntasks-per-node=1
    #SBATCH --cluster=mpi # mpi, gpu and smp are available in H2P
    #SBATCH --partition= # available: smp, high-mem, opa, gtx1080, titanx, k40
    #SBATCH --mail-user=<user_ID>@pitt.edu #send email to this address if ...
    #SBATCH --mail-type=END,FAIL # ... job ends or fails
    #SBATCH --time=6-00:00:00 # 6 days walltime in dd-hh:mm format
    #SBATCH --qos=long # required if walltime is greater than 3 days
    module purge #make sure the modules environment is sane
    module load intel/2017.1.132 intel-mpi/2017.1.132 fhiaims/160328_3
    cp <inputs> $SLURM_SCRATCH # Copy inputs to scratch
    cd $SLURM_SCRATCH
    # Set a trap to copy any temp files you may need
    run_on_exit(){
     cp -r $SLURM_SCRATCH/* $SLURM_SUBMIT_DIR
    }
    trap run_on_exit EXIT 
    srun <job executable with parameters> # Run the job
    crc-job-stats.py # gives stats of job, wall time, etc.
    cp <outputs> $SLURM_SUBMIT_DIR # Copy outputs to submit directory

    Service Units

    One service unit (SU) is approximately equal to 1 core hour of computing. The charge is calculated based on 2 factors:

    • Number of cores requested
    • RAM requested

    We take the maximum charge of the 2 factors. If one requests the default RAM (Requested Cores * Total RAM / Total Cores) you are charged the exact same amount for cores and RAM. Additionally, each of the partitions (see Clusters and Partitions) have a scaling factor. The scaling factor compensates for the cost of the hardware itself. Some examples (using a scale factor of 1):

    • 12 core job, default RAM, 2 hours: 24 SUs
    • 1 core job, all RAM on 12 core machine, 4 hours: 48 SUs

    Below, I will break down H2P in clusters (numbered) and partitions (lettered) and provide the scaling factors discussed in Service Units.

    1. smp [default] (Shared Multiprocessing): Single node (OpenMP) and single core jobs
      1. smp [default]: 0.8
      2. smp-high: 1
    2. gpu (Graphics Processing Units): Single- or multi-GPU processing
      1. gtx1080 [default]: 1 per GPU card
      2. titanx: 3 per GPU card
      3. titan: 0.333 per GPU card
      4. k40: 6 per GPU card
    3. mpi (Message Passing Interface): Multi-Node Jobs
      1. opa [default]: 1
      2. legacy: 0.333
      3. ib: 1

    You can simply look at this information by running scontrol -M smp,mpi,gpu show partition.

    Fair Share and Priority

    Slurm allows all groups to have equal opportunity to run calculations. If we consider the entire computing resource as a piece of pie, each group gets an equal piece of the pie. That group’s piece is then distributed equally to each user in the group. This concept is called “Fair Share”. The fair share is a multiplicative factor in computing a job’s “Priority”. At Pitt, we use a multi-factor priority system which includes:

    • Age – How long has the job been in the queue
    • Fair share – Has everyone had an equal opportunity to compute
    • Job size – Large core counts with short wall times are prioritized
    • Quality of service (QOS) – Shorter wall times are better
      • short (max: 24 hours) – QOS factor multiplied by 2
      • normal (max: 3 days) – QOS factor multiplied by 1
      • long (max: 6 days) – QOS factor multiplied by 0

    To compute priority, one simply sums up all of the individual factors. Higher priorities will go into the queue first.

    Basic Commands

    • sinfo – Quick view of hardware allocated and free
    • smap – More visual version of sinfo using ncurses
    • sview – Graphical interface of hardware (requires X11).
    • sbatch <job_script> – Submit a job file
    • squeue – View all running jobs
    • squeue -u <user> – View particular <user>’s jobs (could be you)
    • sshare – View fairshare information
    • sprio – View queued job’s priority

     

    If you are a PBS Torque user and want to migrate to Slurm, you can find the equivalent examples for PBS and Slurm job scripts in the following table.

    Command PBS/Torque Slurm
    Job submission qsub job_script sbatch job_script
    Job submission qsub -q queue -l nodes=1:ppn=16 -l mem=64g job_script sbatch --partition=queue --nodes=1 --cpus-per-node=16 --mem=64g job_script
    Node count -l nodes=count --nodes=1
    Cores per node -l ppn=count --cpus-per-node=count
    Memory size -l mem=16384 --mem=16g
    Wall clock limit -N name  --job-name=name

    The sbatch arguments here are the minimal subset required to accurately specify a job on the h2p cluster. Please refer to man sbatch for more options.

    SBATCH ARGUMENT DESCRIPTION
    --nodes Maximum number of nodes to be used by each Job Step.
    --tasks-per-node Specify the number of tasks to be launched per node.
    --cpus-per-task Advise the Slurm controller that ensuing job steps will require a certain number of processors per task.
    --error

    File to redirect standard error.

    --job-name The job name.
    --time Define the total time required for the job
    The format is days-hh:mm:ss.
    --cluster Select the cluster to submit the job to. smp, mpi and gpu are the available partition in the H2P
    --partition

    Select the partition to submit the job to. smp, high-mem for smp cluster, opa, legacy for mpi cluster, gtx1080, titan, titanx and k40 for gpu cluster.

    srun also takes the --nodes,--tasks-per-node and --cpus-per-task arguments to allow each job step to change the utilized resources but they cannot exceed those given to sbatch. The above arguments can be provided in a batch script by preceding them with #SBATCH. Note that the shebang (#!) line must be present. The shebang line can call any shell or scripting language available on the cluster. For example, #!/usr/bin/env bash.

    Slurm is very explicit in how one requests cores and nodes. While extremely powerful, the three flags,--nodes, --ntasks, and --cpus-per-task can be a bit confusing at first.

    --ntasks vs. --cpus-per-task

    The term “task” in this context can be thought of as a “process”. Therefore, a multi-process program (e.g. MPI) is comprised of multiple tasks. In Slurm, tasks are requested with the --ntasks flag. A multi-threaded program is comprised of a single task, which can in turn use multiple CPUs. CPUs, for the multithreaded programs, are requested with the --cpus-per-task flag. Individual tasks cannot be split across multiple compute nodes, so requesting a number of CPUs with --cpus-per-task flag will always result in all your CPUs allocated on the same compute node.

    CRC wrappers

    The CRC Team wanted to make your lives a little bit easier. We wrote some scripts to save you time. The scripts are written in Python and Perl. All scripts should accept -h and --help to provide help.

    What’s Available?

    • crc-sinfo.py: Will show you an overview of the current hardware status.
    • crc-squeue.py: Look at your jobs in a convenient way
      • crc-squeue.py --start: Show approximate start time for your jobs, won’t show if you hit association limits
      • crc-squeue.py --watch: Watch your jobs as they progress in time (updates 10 seconds at a time)
      • crc-squeue.py --all: Show all the jobs on the cluster
    • crc-scancel.py: Cancel the the job with the JobID 
    • crc-usage.py: Show your usage on each cluster
      • For now, this will only show your primary group. Try groups | cut -d' ' -f1 to find your primary group.
    • crc-interactive.py: Run interactive jobs on the cluster
      • To submit an interactive job, you should use the CRC wrapper:

        crc-interactive.py --smp --time=1 --num-cpus=2

        would give you an interactive job for 1 hour on SMP with 2 processors. When the interactive job starts, you will notice that you are no longer on a login node, but rather one of the compute nodes.

        [shs159@smp-n2 ˜]$

        Try crc-interactive.py -h for more details.

    • crc-job-stats.py
      • This script is meant to be added at the bottom of your Slurm scripts (after srun) to give the statististic of your job.