====== EEMCS-HPC Specific resources (gpu), features and partitions ======

===== Generic Resources =====

The generic consumable resources on the HPC cluster are :
  * **gpu[:family]:amount** (omit the family directive or use one of the family names listed below).
  * others are not supported yet.

To request a gpu add the following line to your sbatch script :
<code bash>
#SBATCH --gres=gpu:1 
</code>

For selecting a specific gpu family, add the family name to the gpu request.
  * pascal
  * turing
  * ampere
  * lovelace
  * blackwell

To request a specific gpu family, update the following line in your sbatch script (As an example the lovelace is selected) :
<code bash>
#SBATCH --gres=gpu:lovelace:1 
</code>


[[:eemcs-hpc:software#optional_software|Keep in mind for gpu's you need to load the module of the required cuda version !]]

Once you request a gpu(s) resource, the scheduler will set for you the environment variable : **CUDA_VISIBLE_DEVICES**

This will point to the assigned gpu(s) for your job, you shall only use those and not others !!!

==== nvlink ====
Some gpu boards have the nvlink modules fitted, this will allow you to double the gpu memory and cumputing power.
If you request two gpu's with nvlink, you need to force socket binding using the following option :
<code bash>
#SBATCH --sockets-per-node=1 
</code>

===== features (constraint) =====

The available features are :
  * **amd**/**intel**, refers to the silicon builder of the Cpus
  * **avx**/**avx2**/**avx512**, refers to the avx, avx2 and avx512 instruction set, available in the newer nodes (for example: **Keras**, **Tensorflow** v1.6 and above)
  * **tesla**, refers to the Tesla family cards (ctit080)
  * **geforce**, refers to the GeForce family cards (ctit[081-083])
  * **quadro**, refers to the Quadro family cards (ctit[084-085])
  * **p100**, refers to the specific Tesla P100 model (ctit080)
  * **titan-x** refers to the specific Titan-X model (ctit[081-082])
  * **gtx-1080ti** refers to the specific GeForce gtx-1080ti model (ctit083).
  * **rtx-6000** refers to the specific Quadro rtx-6000 model (ctit[084-085]).
  * **rtx-2080ti** refers to the specific GeForce rtx-2080ti model (ctit088). 
  * **titan-xp** refers to the specific Titan-X model (ctit088)
  * **a40** refers to the specific Tesla A40 model (ctit086, 90..94)
  * **a100** refers to the specific Tesla A100 model (ctit089)
  * **l40** refers to the specific Tesla L40 model (hpc-node01..12,14)
  * **l40s** refers to the specific Tesla L40s model (hpc-node16..17)
  * **rtx6000pro** refers to the specific Rtx 6000 Pro model (hpc-node15)
For example to force only a40 gpu's
<code bash>
#SBATCH --constraint=a40 
</code>

Hint : Use ampersand (AND) and pipe (OR) symbol to combine features.

===== partitions =====

The HPC/SLURM cluster contains multiple common partitions :
| **Partition name** | **Nodes** | **Details** | **available to** |
| main | ctit[080-094],caserta,hpc-node[01-07] | | All |
| debug | ctit[080-094],caserta,hpc-node[01-12,14-19] | | admin |

As well as multiple additional partitions :
| **Partition name** | **Nodes** | **Details** | **available to** |
| am | hpc-node[01-04] | | eemcs-(dmmp/macs/mast/mia/mms/sor/stat) |
| bdsi | ctit087 | | bms-bdsi |
| bmpi | hpc-node[15-16] | gpu | tnw-bmpi |
| bss | hpc-node15 | cpu | eemcs-bss |
| dmb | ctit[084-085,092],hpc-node[07,09] | | eemcs-dmb |
| mia | ctit[090-091,093-094],hpc-node05 | | eemcs-mia |
| mia-pof | hpc-node06 | | eemcs-mia & tnw-pof |
| tfe | hpc-node[16-19] | cpu | et-(efd/msm/te/tcs/htt/gm/cmmm) |
| ps | hpc-node[11-12,14] | | eemcs-ps |
| ram | ctit[086,089] | | eemcs-ram |
| students | hpc-node08 | | eemcs-students |

The **main** partition is the default partition that will be used to submit a job to any of the nodes.
The **debug** partition is for testing purposes only.

Access to the additional are limited to the funders during the first year of investment, these can be reached using the funders partitions.

Including multiple partitions is also possible.
For example : 
  * EEMCS-DMB members
<code bash>
#SBATCH --partition=main,dmb
</code>
  * EEMCS-MIA members
<code bash>
#SBATCH --partition=main,am,mia
</code>
  * EEMCS-Students
<code bash>
#SBATCH --partition=main,students
</code>
  * etc.

See the **[[eemcs-hpc:hardware|EEMCS-HPC Hardware page]]** for all the partition definitions.