===== Usefull admin tips for the EEMCS-HPC CLuster ===== In order to use additional slurm scripts, load the following module : module load slurm/utils ==== Users/Accounts ==== A User is a single person, an Account is a group of users, or even a group of groups. === Activating Users === - Add the user to the NIS database, the user will be added to the default cluster account **ctit**. - move the user to the correct cluster account (read research or student group) for this you can use the following script : sacctmgr-move-user ctit === Adjust the Quality of Service or QOS. === Every account already wil have a default QOS defined. In some cases a different QOS is required, to Modifying Quality of service use the following command : # default QOS sudo sacctmgr modify user where name= set DefaultQOS= # additional QOS (temporary) sudo sacctmgr modify user where name= set QOS+= # remove additional QOS (temporary) sudo sacctmgr modify user where name= set QOS-= === Creating new Accounts === Creating new accounts (read reasearch groups) : sudo sacctmgr create account name= parent= fairshare=1 The fair share factor is the amount of investment in K€ ! === dump userdatabase === In order to recover or get an overview of all the activated account an **dump configuration file** can be generated using following command : sudo sacctmgr-dump This will create a ctit_.cfg file containing all the accounts/users and their priority factor structure. ==== GPU monitoring ==== GPU's can be monitored using the tools supplied in the module nvidia/nvtop : module load nvidia/nvtop To monitor the GPU's on a specific node, use one of the following commands : nvidia-smi-node nvtop-node To show the jobs assigned gpu, use the following command : scontrol show job -d //Note : look for GRES=gpu(IDX:...) // ==== Utilization ==== Utilization reports can be generated using one of the following commands : sreport cluster AccountUtilization cluster=ctit start=1/1/21 end=12/31/21 > Utilisation_2021 sreport cluster AccountUtilizationByUser cluster account= start=2020-03-25 end=2020-03-25 sreport cluster AccountUtilizationByUser cluster user= start=2020-03-25 end=2020-03-25 ==== Maintenance ==== === create maintenance reservation === To create a maintenance reservation use the following command : scontrol create reservation starttime=2022-03-23T8:00:00 duration=480 user=root flags=maint,ignore_jobs nodes=ALL === terminate running jobs === squeue -ho %A -t R | xargs -n 1 scancel === stopping slurm daemons === scontrol shutdown === undraining a node === sudo scontrol update NodeName= State=DOWN Reason="undraining" sudo scontrol update NodeName= State=RESUME === system serial number === sudo dmidecode -s system-serial-number ==== Powersaving ==== To keep the power usage at a lower level, compute nodes not being used will powerdown after a certain amount of time. These definitions are located in the //slurm.conf// file. See the //SuspendTime// for the actual time. To disable this functionality change to following line: #SuspendExcParts=debug to: SuspendExcParts=debug