slurm admin guide
This is a small guide for the slurm admins.
Accounting
In order to guarantee the access and priority on the cluster, accounting is being used to alter priority, set limits, give rights to the available resources within the cluster. The structure of accounts is similar to faculty combined with research groups (and optionally with projects). Additionally the use of resources is being tracked for statistics and/or debugging purposes.
Fair Share
All the groups that will have done investments in the EEMCS-HPC cluster will have a higher priority factor related to the amount of investment, this will stay valid for the duration of the usage of the purchased hardware. For example an investment of €30K will result in a factor of 30x. The Admins will track the purchases and availability of them.
Reports
Use the slurm sreport utility to generate reports
sreport cluster AccountUtilizationByUser cluster user=<username> start=2020-03-25 end=2020-03-25 sreport cluster AccountUtilizationByUser cluster account=<research-group or faculty> start=2020-03-25 end=2020-03-25 sreport cluster AccountUtilization cluster=ctit start=1/1/21 end=12/31/21 > Utilisation_2021
Modify Accounts or Users
Use the slurm sacctmgr to add/modify/remove accounts or users
sacctmgr create account name=<account-name> parent=<parent-account-name> fairshare=1 sacctmgr --immediate add user <user-name> account=<account-name> sacctmgr modify user where name=<user-name> set DefaultQOS='<defualt-qos>' sacctmgr modify user where name=<user-name> set QOS+='<additional-qos>' sacctmgr show user name=<user-name>
Reservations
For high priority workloads, reservations can be created, users will need to use the reservation name to use them.
scontrol create reservationname=test start=now duration=1 user=user1 partition=gpu tres=cpu=24,node=4
Reservations can also be used to block users from running jobs during maintenance windows.
scontrol create reservation starttime=2022-03-23T8:00:00 duration=480 user=root flags=maint,ignore_jobs nodes=ALL
Unresponsive Nodes
Due to unforeseen reasons a node can have a failure, slurm will place a node in a certain state to block jobs running until the matter is solved. This can happen due to software bugs, reaching limits or other unknown reason. Check the logs of the slurmctld on the headnode, slurmd on the compute node(s) and additionally the syslog file.
To undrain a certain node use the following commands :
sudo scontrol update NodeName=node10 State=DOWN Reason="undraining" sudo scontrol update NodeName=node10 State=RESUME
Slurm wrappers/scripts
sacctmgr-dump
This is a small wrapper that will dump the account database to screen and a dated file.
sacctmgr-dump
sacctmgr-move-user
This is a small wrapper that will help admins to move users between accounts. Default all users will land in the ctit account. In order to allow the limited resources, keep track of the correct usage, users need to be placed in the correct accounts. These accounts are based on the faculty and research groups (and optionally projects) in hierarchical form.
sacctmgr-move-user <accountfrom> <accountto> <username>
for example :
sacctmgr-move-user ctit dmb laanstragj