===== Slurm HPC Scheduler ===== Slurm is a highly configurable open-source workload manager. In its simplest configuration, it can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers. More complex configurations rely upon a database for archiving accounting records, managing resource limits, and supporting sophisticated scheduling algorithms. ==== Architecture ==== {{ :wiki:hpc:slurm_arch.gif?nolink&800 |}} As depicted in the above picture, Slurm consists of a worker daemon (slurmd) running on each compute node and a central controller daemon (slurmctld) running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. ==== Features & Generic consumable Resources ==== This cluster also supports features (--constraint) and/or generic consumable resources (--gres). The available features are : * **amd**/**intel**, refers to the silicon builder of the Cpus * **avx**/**avx2**, refers to the avx and avx2 instruction set, available in the newer nodes (for example: **Keras**, **Tensorflow** v1.6 and above) * **tesla**, refers to the Tesla family cards (ctit080) * **geforce**, refers to the GeForce family cards (ctit[081-083]) * **quadro**, refers to the Quadro family cards (ctit[084-085]) * **p100**, refers to the specific Tesla P100 model (ctit080) * **titan-x** refers to the specific Titan-X model (ctit[081-082]) * **gtx-1080ti** refers to the specific GeForce gtx-1080ti model (ctit083). * **rtx-6000** refers to the specific Quadro rtx-6000 model (ctit[084-085]). The generic consumable resources are : * **gpu[:pascal/turing/ampere][:amount]** (currently we only have pascal/turing/ampere based gpu's). [[wiki:software:start#optional_software|Keep in mind for gpu's you need to load the module of the required cuda version !]] ==== Partitions ==== The HPC/SLURM cluster contains multiple partitions: | **Partition name** | **Nodes** | **Details** | **available to** | | main | all | | All | | r930 | caserta | | All | | r730 | ctit080 | 2x Tesla P100 | All | | t630 | ctit081..83 | 2x Titan-X / 4x Titan-X / 4x 1080-Ti | All | | gpu_p100 | ctit080 | 2x Tesla P100 | All | | gpu_titan-x | ctit081..82 | 2x Titan-X / 4x Titan-X | All | | gpu_1080-ti | ctit083 | 4x 1080-Ti | All | | gpu_rtx-6000 | ctit084..85 | 4x Quadro RTX6000 | eemcs/dmb | | gpu_a40 | ctit086,90,91 | 4x Tesla A40 | eemcs/ram/b3care or eemcs/aa | | gpu_a100 | ctit089 | 4x Tesla A40 | eemcs/ram | | debug | all | | admin | | dmb | ctit084..085, 88 | | eemcs/dmb | | ram | ctit089 | | eemcs/ram | ram_b3care | ctit086 | | eemcs/ram-b3care | | bdsi | ctit087 | | bms/bdsi | The **main** partition is the default partition that will be used to submit a job to any of the nodes. To use only specific models of machines you can use one of the following partitions : * **r930**, **r730**, **t630**, **t640**. To use only the machines containing certain gpu's, you can use one of the following partitions : * **gpu_p100**, **gpu_1080-ti**, **gpu_titan-x**, **gpu_rtx-6000**, **gpu_a40**, gpu_t4, gpu_a100. Access to the following partitions are limited during the first year of investment. * **ram-b3care**, until 2022-march-1 * **bdsi**, until 2022-march-1 * **aa**, until 2023-march-1 * **dmb**, until 2023-march-1 * **ram**, until 2023-march-1 The **debug** partition is for testing purposes only. ==== Submitting Jobs ==== Before submitting jobs please note the maximum number of jobs and maximum number of job steps per job which can be scheduled. These numbers can be obtained using the //scontrol show config// command on the **korenvliet**. **sbatch** is used to submit a job script for later execution. The script will typically contain one task or (if required) multiple **srun** commands to launch parallel tasks. See the **[[wiki:slurm:sbatch]]** and **[[wiki:slurm:srun]]** wiki page for more details. ==== Monitoring Slurm ==== To monitor the jobs and progress you can use the **[[http://korenvliet.ewi.utwente.nl/slurm/|slurm dashboard page]]** or the available command line tools like **squeue** or **scontrol**.