Slurm is a highly configurable open-source workload manager. In its simplest configuration, it can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers. More complex configurations rely upon a database for archiving accounting records, managing resource limits, and supporting sophisticated scheduling algorithms.

As depicted in the above picture, Slurm consists of a worker daemon (slurmd) running on each compute node and a central controller daemon (slurmctld) running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications.

This cluster also supports features (–constraint) and/or generic consumable resources (–gres).

The available features are :

  • amd/intel, refers to the silicon builder of the Cpus
  • avx/avx2, refers to the avx and avx2 instruction set, available in the newer nodes (for example: Keras, Tensorflow v1.6 and above)
  • tesla, refers to the Tesla family cards (ctit080)
  • geforce, refers to the GeForce family cards (ctit[081-083])
  • quadro, refers to the Quadro family cards (ctit[084-085])
  • p100, refers to the specific Tesla P100 model (ctit080)
  • titan-x refers to the specific Titan-X model (ctit[081-082])
  • gtx-1080ti refers to the specific GeForce gtx-1080ti model (ctit083).
  • rtx-6000 refers to the specific Quadro rtx-6000 model (ctit[084-085]).

The generic consumable resources are :

  • gpu[:pascal/turing/ampere][:amount] (currently we only have pascal/turing/ampere based gpu's).

Keep in mind for gpu's you need to load the module of the required cuda version !

The HPC/SLURM cluster contains multiple partitions:

Partition name Nodes Details available to
main all All
r930 caserta All
r730 ctit080 2x Tesla P100 All
t630 ctit081..83 2x Titan-X / 4x Titan-X / 4x 1080-Ti All
gpu_p100 ctit080 2x Tesla P100 All
gpu_titan-x ctit081..82 2x Titan-X / 4x Titan-X All
gpu_1080-ti ctit083 4x 1080-Ti All
gpu_rtx-6000 ctit084..85 4x Quadro RTX6000 eemcs/dmb
gpu_a40 ctit086,90,91 4x Tesla A40 eemcs/ram/b3care or eemcs/aa
gpu_a100 ctit089 4x Tesla A40 eemcs/ram
debug all admin
dmb ctit084..085, 88 eemcs/dmb
ram ctit089
ram_b3care ctit086 eemcs/ram-b3care
bdsi ctit087 bms/bdsi

The main partition is the default partition that will be used to submit a job to any of the nodes.

To use only specific models of machines you can use one of the following partitions :

  • r930, r730, t630, t640.

To use only the machines containing certain gpu's, you can use one of the following partitions :

  • gpu_p100, gpu_1080-ti, gpu_titan-x, gpu_rtx-6000, gpu_a40, gpu_t4, gpu_a100.

Access to the following partitions are limited during the first year of investment.

  • ram-b3care, until 2022-march-1
  • bdsi, until 2022-march-1
  • aa, until 2023-march-1
  • dmb, until 2023-march-1
  • ram, until 2023-march-1

The debug partition is for testing purposes only.

Before submitting jobs please note the maximum number of jobs and maximum number of job steps per job which can be scheduled. These numbers can be obtained using the scontrol show config command on the korenvliet.

sbatch is used to submit a job script for later execution. The script will typically contain one task or (if required) multiple srun commands to launch parallel tasks. See the sbatch and srun wiki page for more details.

To monitor the jobs and progress you can use the slurm dashboard page or the available command line tools like squeue or scontrol.