A User is a single person, an Account is a group of users, or even a group of groups.
In order to use additional slurm scripts, load the following module :
module load slurm/utils
for this you can use the following script :
sacctmgr-move-user <groupname> <username>
or
sacctmgr-move-user ctit <groupname> <username>
Every account already wil have a default QOS defined.
In some cases a different QOS is required, to Modifying Quality of service use the following command :
# default QOS sudo sacctmgr modify user where name=<username> set DefaultQOS=<defaultQOSname> # additional QOS (temporary) sudo sacctmgr modify user where name=<username> set QOS+=<additionalQOSname> # remove additional QOS (temporary) sudo sacctmgr modify user where name=<username> set QOS-=<additionalQOSname>
Creating new accounts (read reasearch groups) :
sudo sacctmgr create account name=<groupname> parent=<faculty> fairshare=1
The fair share factor is the amount of investment in K€ !
In order to recover or get an overview of all the activated account an dump configuration file can be generated using following command :
sudo sacctmgr-dump
This will create a ctit_<date>.cfg file containing all the accounts/users and their priority factor structure.
If the /local folder gets clogged, check what jobs are running on that specific node :
squeue --nodelist=<node-name>
Remove all unrelated folder within the /local folder.
GPU's can be monitored using the tools supplied in the module nvidia/nvtop :
module load nvidia/nvtop
To monitor the GPU's on a specific node, use one of the following commands :
nvidia-smi-node <nodename> nvtop-node <nodename>
To show the jobs assigned gpu, use the following command :
scontrol show job <jobid> -d
Note : look for GRES=gpu(IDX:…)
To show the history of a job, use the following command :
sacct --format=User,Account,Partition,State%25,ExitCode,AllocTRES%60 -j <jobid>
Node : more parameters can be monitored, see : sacct –helpformat
Utilization Cpu reports can be generated using one of the following commands :
sreport cluster AccountUtilization cluster=ctit -P start=1/1/21 end=12/31/21 > Utilisation_2021 sreport cluster AccountUtilizationByUser cluster=ctit account=<account_name> start=2020-03-25 end=2020-03-25 sreport cluster AccountUtilizationByUser cluster=ctit user=<user_name> start=2020-03-25 end=2020-03-25
Utilization for Gpus can be generated using one of the following commands :
sreport cluster AccountUtilization cluster=ctit -P --tres="gres/gpu" start=1/1/21 end=12/31/21 > Utilisation_2021 sreport cluster AccountUtilizationByUser cluster=ctit --tres="gres/gpu" account=<account_name> start=2020-03-25 end=2020-03-25 sreport cluster AccountUtilizationByUser cluster=ctit --tres="gres/gpu" user=<user_name> start=2020-03-25 end=2020-03-25
To create a maintenance reservation use the following command :
scontrol create reservation starttime=2022-03-23T8:00:00 duration=480 user=root flags=maint,ignore_jobs nodes=ALL
squeue -ho %A -t R | xargs -n 1 scancel
scontrol shutdown
sudo scontrol update NodeName=<node_name> State=DOWN Reason="undraining" sudo scontrol update NodeName=<node_name> State=RESUME
sudo dmidecode -s system-serial-number
To keep the power usage at a lower level, compute nodes not being used will powerdown after a certain amount of time. These definitions are located in the slurm.conf file. See the SuspendTime for the actual time. To disable this functionality change to following line:
#SuspendExcParts=debug
to:
SuspendExcParts=debug
To powerup a node temporary execute the following command:
sudo scontrol update node=<nodelist> state=POWER_UP
or powerdown with :
sudo scontrol update node=<nodelist> state=POWER_DOWN