EEMCS-HPC Specific resources (gpu), features and partitions

The generic consumable resources are :

  • gpu[:pascal/turing/ampere/lovelace][:amount] (currently we only have pascal/turing/ampere/lovelace based gpu's).
#SBATCH --gres=gpu:1 

Keep in mind for gpu's you need to load the module of the required cuda version !

Once you request a gpu(s) resource, the scheduler will set for you the environment variable : CUDA_VISIBLE_DEVICES

This will point to the assigned gpu(s) for your job, you shall only use those and not others !!!

Some gpu boards have the nvlink modules fitted, this will allow you to double the gpu memory and cumputing power. If you request two gpu's with nvlink, you need to force socket binding using the following option :

#SBATCH --sockets-per-node=1 

The available features are :

  • amd/intel, refers to the silicon builder of the Cpus
  • avx/avx2/avx512, refers to the avx, avx2 and avx512 instruction set, available in the newer nodes (for example: Keras, Tensorflow v1.6 and above)
  • tesla, refers to the Tesla family cards (ctit080)
  • geforce, refers to the GeForce family cards (ctit[081-083])
  • quadro, refers to the Quadro family cards (ctit[084-085])
  • p100, refers to the specific Tesla P100 model (ctit080)
  • titan-x refers to the specific Titan-X model (ctit[081-082])
  • gtx-1080ti refers to the specific GeForce gtx-1080ti model (ctit083).
  • rtx-6000 refers to the specific Quadro rtx-6000 model (ctit[084-085]).
  • rtx-2080ti refers to the specific GeForce rtx-2080ti model (ctit088).
  • titan-xp refers to the specific Titan-X model (ctit088)
  • a40 refers to the specific Tesla A40 model (ctit086, 90..94)
  • a100 refers to the specific Tesla A100 model (ctit089)
  • l40 refers to the specific Tesla L40 model (hpc-node01..08)

For example to force only a40 gpu's

#SBATCH --constraint=a40 

The HPC/SLURM cluster contains multiple common partitions :

Partition name Nodes Details available to
main ctit080..91 All
debug all admin
dmb ctit084..085, 88, 92, hpc-node07 eemcs-dmb
ram ctit086, ctit089 eemcs-ram
bdsi ctit087 bms-bdsi
mia ctit090..91,93..94,hpc-node05 eemcs-mia
am hpc-node01..04 eemcs-(dmmp/macs/mast/mia/mms/sor/stat)
mia-pof hpc-node06 eemcs-mia & tnw-pof
students hpc-node08 eemcs-students

The main partition is the default partition that will be used to submit a job to any of the nodes. The debug partition is for testing purposes only. Access to the following partitions are limited to the funders during the first year of investment, these can be reached using the funders partition.

Including multiple partitions is also possible. For example :

  • EEMCS-DMB members
#SBATCH --partition=main,dmb
  • EEMCS-MIA members
#SBATCH --partition=main,am,mia
  • EEMCS-Students
#SBATCH --partition=main,students
  • etc.

See the EEMCS-HPC Hardware page for all the partition definitions.