Usefull admin tips for the HPC CLuster [University of Twente, HPC Wiki]

A User is a single person, an Account is a group of users, or even a group of groups.

Activating Users

(hpc-storage3) Add the user to the NIS database, the user will be added to the default cluster account ctit

In order to use additional slurm scripts, load the following module :

module load slurm/utils

(hpc-head1/2) move the user to the correct cluster account (read research or student group)

for this you can use the following script :

sacctmgr-move-user <groupname> <username>

or

sacctmgr-move-user ctit <groupname> <username>

(ad.utwente.nl) add them to the mailing list : EEMCS-Hpc-Cluster-Users (EEMCS-IDS-Cluster-Users)
send the requesting user en explanation how to get started.

Adjust the Quality of Service or QOS for a user.

Every Account (group level) already wil have a default QOS defined.

In some cases a different QOS for the user is required, to Modifying Quality of service use the following command :

# default QOS
sudo sacctmgr modify user where name=<username> set DefaultQOS=<defaultQOSname>
# additional QOS (temporary)
sudo sacctmgr modify user where name=<username> set QOS+=<additionalQOSname>
# remove additional QOS (temporary)
sudo sacctmgr modify user where name=<username> set QOS-=

Creating new Accounts

Creating new accounts (read reasearch groups) :

sudo sacctmgr create account name=<groupname> parent=<faculty> fairshare=1 DefaultQOS=<research/guest-research> QOS=<research/guest-research>

The fair share factor is the amount of investment in K€ !

dump userdatabase

In order to recover or get an overview of all the activated account an dump configuration file can be generated using following command :

sudo sacctmgr-dump

This will create a ctit_<date>.cfg file containing all the accounts/users and their priority factor structure.

If the /local folder gets clogged, check what jobs are running on that specific node :

squeue --nodelist=<node-name>

Remove all unrelated folder within the /local folder.

GPU's can be monitored using the tools supplied in the module nvidia/nvtop :

module load nvidia/nvtop

To monitor the GPU's on a specific node, use one of the following commands :

nvidia-smi-node <nodename>
nvtop-node <nodename>

To show the jobs assigned gpu, use the following command :

scontrol show job <jobid> -d

Note : look for GRES=gpu(IDX:…)

To show the history of a job, use the following command :

sacct --format=User,Account,Partition,State%25,ExitCode,AllocTRES%60 -j <jobid>

Node : more parameters can be monitored, see : sacct –helpformat

Disable Gpu

Set a problematic gpu in drain mode (Xid errors in the /var/log/syslog) :

sudo nvidia-smi drain -p 0000:<id>:00.0 -m 1

Utilization Cpu reports can be generated using one of the following commands :

 sreport cluster AccountUtilization cluster=ctit -P start=1/1/21 end=12/31/21 > Utilisation_2021
 sreport cluster AccountUtilizationByUser cluster=ctit account=<account_name> start=2020-03-25 end=2020-03-25
 sreport cluster AccountUtilizationByUser cluster=ctit user=<user_name> start=2020-03-25 end=2020-03-25

Utilization for Gpus can be generated using one of the following commands :

 sreport cluster AccountUtilization cluster=ctit -P --tres="gres/gpu" start=1/1/21 end=12/31/21 > Utilisation_2021
 sreport cluster AccountUtilizationByUser cluster=ctit --tres="gres/gpu" account=<account_name> start=2020-03-25 end=2020-03-25
 sreport cluster AccountUtilizationByUser cluster=ctit --tres="gres/gpu" user=<user_name> start=2020-03-25 end=2020-03-25

Maintenance

create maintenance reservation

To create a maintenance reservation use the following command :

scontrol create reservation starttime=2022-03-23T8:00:00 duration=480 user=root flags=maint,ignore_jobs nodes=ALL

terminate running jobs

squeue -ho %A -t R | xargs -n 1 scancel

stopping slurm daemons

scontrol shutdown

undraining a node

sudo scontrol update NodeName=<node_name> State=DOWN Reason="undraining"
sudo scontrol update NodeName=<node_name> State=RESUME

system serial number

sudo dmidecode -s system-serial-number

To keep the power usage at a lower level, compute nodes not being used will powerdown after a certain amount of time. These definitions are located in the slurm.conf file. See the SuspendTime for the actual time. To disable this functionality change to following line:

#SuspendExcParts=debug

to:

SuspendExcParts=debug

To powerup a node temporary execute the following command:

sudo scontrol update node=<nodelist> state=POWER_UP

or powerdown with :

sudo scontrol update node=<nodelist> state=POWER_DOWN