===== Usefull admin tips for the HPC CLuster ===== ==== Users/Accounts ==== A **User** is a single person, an **Account** is a group of users, or even a group of groups. === Activating Users === * (hpc-storage3) Add the user to the NIS database, the user will be added to the default cluster account **ctit** In order to use additional slurm scripts, load the following module :


module load slurm/utils

* (hpc-head1/2) move the user to the correct cluster account (read research or student group) for this you can use the following script :


sacctmgr-move-user


sacctmgr-move-user ctit

* (ad.utwente.nl) add them to the mailing list : EEMCS-Hpc-Cluster-Users (EEMCS-IDS-Cluster-Users) * send the requesting user en explanation how to get started. === Adjust the Quality of Service or QOS for a user. === Every Account (group level) already wil have a default QOS defined. In some cases a different QOS for the user is required, to Modifying Quality of service use the following command :


# default QOS
sudo sacctmgr modify user where name= set DefaultQOS=
# additional QOS (temporary)
sudo sacctmgr modify user where name= set QOS+=
# remove additional QOS (temporary)
sudo sacctmgr modify user where name= set QOS-=

=== Creating new Accounts === Creating new accounts (read reasearch groups) :


sudo sacctmgr create account name= parent= fairshare=1 DefaultQOS= QOS=

The fair share factor is the amount of investment in K€ ! === dump userdatabase === In order to recover or get an overview of all the activated account an **dump configuration file** can be generated using following command :


sudo sacctmgr-dump

This will create a ctit_.cfg file containing all the accounts/users and their priority factor structure. ==== Cleanup /local ==== If the **/local** folder gets clogged, check what jobs are running on that specific node :


squeue --nodelist=

Remove all unrelated folder within the /local folder. ==== GPU monitoring ==== GPU's can be monitored using the tools supplied in the module nvidia/nvtop :


module load nvidia/nvtop

To monitor the GPU's on a specific node, use one of the following commands :


nvidia-smi-node 
nvtop-node

To show the jobs assigned gpu, use the following command :


scontrol show job  -d

//Note : look for GRES=gpu(IDX:...) // To show the history of a job, use the following command :


sacct --format=User,Account,Partition,State%25,ExitCode,AllocTRES%60 -j

//Node : more parameters can be monitored, see : sacct --helpformat // === Disable Gpu === Set a problematic gpu in drain mode (Xid errors in the /var/log/syslog) :


sudo nvidia-smi drain -p 0000::00.0 -m 1

==== Utilization ==== Utilization Cpu reports can be generated using one of the following commands :


 sreport cluster AccountUtilization cluster=ctit -P start=1/1/21 end=12/31/21 > Utilisation_2021
 sreport cluster AccountUtilizationByUser cluster=ctit account= start=2020-03-25 end=2020-03-25
 sreport cluster AccountUtilizationByUser cluster=ctit user= start=2020-03-25 end=2020-03-25

Utilization for Gpus can be generated using one of the following commands :


 sreport cluster AccountUtilization cluster=ctit -P --tres="gres/gpu" start=1/1/21 end=12/31/21 > Utilisation_2021
 sreport cluster AccountUtilizationByUser cluster=ctit --tres="gres/gpu" account= start=2020-03-25 end=2020-03-25
 sreport cluster AccountUtilizationByUser cluster=ctit --tres="gres/gpu" user= start=2020-03-25 end=2020-03-25

==== Maintenance ==== === create maintenance reservation === To create a maintenance reservation use the following command :


scontrol create reservation starttime=2022-03-23T8:00:00 duration=480 user=root flags=maint,ignore_jobs nodes=ALL

=== terminate running jobs ===


squeue -ho %A -t R | xargs -n 1 scancel

=== stopping slurm daemons ===


scontrol shutdown

=== undraining a node ===


sudo scontrol update NodeName= State=DOWN Reason="undraining"
sudo scontrol update NodeName= State=RESUME

=== system serial number ===


sudo dmidecode -s system-serial-number

==== Powersaving ==== To keep the power usage at a lower level, compute nodes not being used will powerdown after a certain amount of time. These definitions are located in the //slurm.conf// file. See the //SuspendTime// for the actual time. To disable this functionality change to following line:


#SuspendExcParts=debug

to:


SuspendExcParts=debug

To powerup a node temporary execute the following command:


sudo scontrol update node= state=POWER_UP

or powerdown with :


sudo scontrol update node= state=POWER_DOWN