===== Slurm HPC Scheduler =====

Slurm is a highly configurable open-source workload manager. In its simplest configuration, it can be installed and configured in a few minutes.  Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers. More complex configurations rely upon a database for archiving accounting records, managing resource limits, and supporting sophisticated scheduling algorithms.

==== Architecture ====
{{ :wiki:hpc:slurm_arch.gif?nolink&800 |}}
As depicted in the above picture, Slurm consists of a worker daemon (slurmd) running on each compute node and a central controller daemon (slurmctld) running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications.

==== Features & Generic consumable Resources ====

This cluster also supports features (--constraint) and/or generic consumable resources (--gres).

The available features are :
  * **amd**/**intel**, refers to the silicon builder of the Cpus
  * **avx**/**avx2**, refers to the avx and avx2 instruction set, available in the newer nodes (for example: **Keras**, **Tensorflow** v1.6 and above)
  * **tesla**, refers to the Tesla family cards (ctit080)
  * **geforce**, refers to the GeForce family cards (ctit[081-083])
  * **quadro**, refers to the Quadro family cards (ctit[084-085])
  * **p100**, refers to the specific Tesla P100 model (ctit080)
  * **titan-x** refers to the specific Titan-X model (ctit[081-082])
  * **gtx-1080ti** refers to the specific GeForce gtx-1080ti model (ctit083).
  * **rtx-6000** refers to the specific Quadro rtx-6000 model (ctit[084-085]).

The generic consumable resources are :
  * **gpu[:pascal/turing/ampere][:amount]** (currently we only have pascal/turing/ampere based gpu's).

[[wiki:software:start#optional_software|Keep in mind for gpu's you need to load the module of the required cuda version !]]

==== Partitions ====

The HPC/SLURM cluster contains multiple partitions:
| **Partition name** | **Nodes** | **Details** | **available to** |
| main | all | | All |
| r930 | caserta | | All |
| r730 | ctit080 | 2x Tesla P100 | All |
| t630 | ctit081..83 | 2x Titan-X / 4x Titan-X / 4x 1080-Ti | All |
| gpu_p100 | ctit080 | 2x Tesla P100 | All |
| gpu_titan-x | ctit081..82 | 2x Titan-X / 4x Titan-X | All |
| gpu_1080-ti | ctit083 | 4x 1080-Ti | All |
| gpu_rtx-6000 | ctit084..85 | 4x Quadro RTX6000 | eemcs/dmb |
| gpu_a40 | ctit086,90,91 | 4x Tesla A40 | eemcs/ram/b3care or eemcs/aa |
| gpu_a100 | ctit089 | 4x Tesla A40 | eemcs/ram |
| debug | all | | admin |
| dmb | ctit084..085, 88 | | eemcs/dmb |
| ram | ctit089 | | eemcs/ram
| ram_b3care | ctit086 | | eemcs/ram-b3care |
| bdsi | ctit087 | | bms/bdsi |

The **main** partition is the default partition that will be used to submit a job to any of the nodes.

To use only specific models of machines you can use one of the following partitions :
  * **r930**, **r730**, **t630**, **t640**.
To use only the machines containing certain gpu's, you can use one of the following partitions :
  * **gpu_p100**, **gpu_1080-ti**, **gpu_titan-x**, **gpu_rtx-6000**, **gpu_a40**, gpu_t4, gpu_a100.

Access to the following partitions are limited during the first year of investment.
  * **ram-b3care**, until 2022-march-1
  * **bdsi**, until 2022-march-1
  * **aa**, until 2023-march-1
  * **dmb**, until 2023-march-1
  * **ram**, until 2023-march-1

The **debug** partition is for testing purposes only.

==== Submitting Jobs ====

Before submitting jobs please note the maximum number of jobs and maximum number of job steps per job which can be scheduled.
These numbers can be obtained using the //scontrol show config// command on the **korenvliet**.

**sbatch** is used to submit a job script for later execution. The script will typically contain one task or (if required) multiple **srun** commands to launch parallel tasks.
See the **[[wiki:slurm:sbatch]]** and **[[wiki:slurm:srun]]** wiki page for more details.

==== Monitoring Slurm ====

To monitor the jobs and progress you can use the **[[http://korenvliet.ewi.utwente.nl/slurm/|slurm dashboard page]]** or the available command line tools like **squeue** or **scontrol**.