===== Usefull admin tips for the EEMCS-HPC CLuster =====
In order to use additional slurm scripts, load the following module :
module load slurm/utils
==== Users/Accounts ====
A User is a single person, an Account is a group of users, or even a group of groups.
=== Activating Users ===
- Add the user to the NIS database, the user will be added to the default cluster account **ctit**.
- move the user to the correct cluster account (read research or student group)
for this you can use the following script :
sacctmgr-move-user ctit
=== Adjust the Quality of Service or QOS. ===
Every account already wil have a default QOS defined.
In some cases a different QOS is required, to Modifying Quality of service use the following command :
# default QOS
sudo sacctmgr modify user where name= set DefaultQOS=
# additional QOS (temporary)
sudo sacctmgr modify user where name= set QOS+=
# remove additional QOS (temporary)
sudo sacctmgr modify user where name= set QOS-=
=== Creating new Accounts ===
Creating new accounts (read reasearch groups) :
sudo sacctmgr create account name= parent= fairshare=1
The fair share factor is the amount of investment in K€ !
=== dump userdatabase ===
In order to recover or get an overview of all the activated account an **dump configuration file** can be generated using following command :
sudo sacctmgr-dump
This will create a ctit_.cfg file containing all the accounts/users and their priority factor structure.
==== GPU monitoring ====
GPU's can be monitored using the tools supplied in the module nvidia/nvtop :
module load nvidia/nvtop
To monitor the GPU's on a specific node, use one of the following commands :
nvidia-smi-node
nvtop-node
To show the jobs assigned gpu, use the following command :
scontrol show job -d
//Note : look for GRES=gpu(IDX:...) //
==== Utilization ====
Utilization reports can be generated using one of the following commands :
sreport cluster AccountUtilization cluster=ctit start=1/1/21 end=12/31/21 > Utilisation_2021
sreport cluster AccountUtilizationByUser cluster account= start=2020-03-25 end=2020-03-25
sreport cluster AccountUtilizationByUser cluster user= start=2020-03-25 end=2020-03-25
==== Maintenance ====
=== create maintenance reservation ===
To create a maintenance reservation use the following command :
scontrol create reservation starttime=2022-03-23T8:00:00 duration=480 user=root flags=maint,ignore_jobs nodes=ALL
=== terminate running jobs ===
squeue -ho %A -t R | xargs -n 1 scancel
=== stopping slurm daemons ===
scontrol shutdown
=== undraining a node ===
sudo scontrol update NodeName= State=DOWN Reason="undraining"
sudo scontrol update NodeName= State=RESUME
=== system serial number ===
sudo dmidecode -s system-serial-number
==== Powersaving ====
To keep the power usage at a lower level, compute nodes not being used will powerdown after a certain amount of time.
These definitions are located in the //slurm.conf// file. See the //SuspendTime// for the actual time.
To disable this functionality change to following line:
#SuspendExcParts=debug
to:
SuspendExcParts=debug