Slurm is a highly configurable open-source workload manager. In its simplest configuration, it can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers. More complex configurations rely upon a database for archiving accounting records, managing resource limits, and supporting sophisticated scheduling algorithms.
As depicted in the above picture, Slurm consists of a worker daemon (slurmd) running on each compute node and a central controller daemon (slurmctld) running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications.
This cluster also supports features (–constraint) and/or generic consumable resources (–gres).
The available features are :
The generic consumable resources are :
Keep in mind for gpu's you need to load the module of the required cuda version !
The HPC/SLURM cluster contains multiple partitions:
Partition name | Nodes | Details | available to |
main | all | All | |
r930 | caserta | All | |
r730 | ctit080 | 2x Tesla P100 | All |
t630 | ctit081..83 | 2x Titan-X / 4x Titan-X / 4x 1080-Ti | All |
gpu_p100 | ctit080 | 2x Tesla P100 | All |
gpu_titan-x | ctit081..82 | 2x Titan-X / 4x Titan-X | All |
gpu_1080-ti | ctit083 | 4x 1080-Ti | All |
gpu_rtx-6000 | ctit084..85 | 4x Quadro RTX6000 | eemcs/dmb |
gpu_a40 | ctit086,90,91 | 4x Tesla A40 | eemcs/ram/b3care or eemcs/aa |
gpu_a100 | ctit089 | 4x Tesla A40 | eemcs/ram |
debug | all | admin | |
dmb | ctit084..085, 88 | eemcs/dmb | |
ram | ctit089 | ||
ram_b3care | ctit086 | eemcs/ram-b3care | |
bdsi | ctit087 | bms/bdsi |
The main partition is the default partition that will be used to submit a job to any of the nodes.
To use only specific models of machines you can use one of the following partitions :
To use only the machines containing certain gpu's, you can use one of the following partitions :
Access to the following partitions are limited during the first year of investment.
The debug partition is for testing purposes only.
Before submitting jobs please note the maximum number of jobs and maximum number of job steps per job which can be scheduled. These numbers can be obtained using the scontrol show config command on the korenvliet.
sbatch is used to submit a job script for later execution. The script will typically contain one task or (if required) multiple srun commands to launch parallel tasks. See the sbatch and srun wiki page for more details.
To monitor the jobs and progress you can use the slurm dashboard page or the available command line tools like squeue or scontrol.