(EEMCS) HPC Cluster
Introduction
One of the clusters at the University is the (EEMCS) HPC Cluster. This cluster, funded by the DSI research institute (formerly known as CTIT), started in the year 2017 as joint operation of several research groups to work on deep learning / AI methods. During the years more groups participated and the cluster got expansion with several more nodes containing multi CPU/GPU combinations. The current participating faculties are the BMS, EEMCS, TNW and ET faculty of the University of Twente.
This HPC cluster is a collection of many separate servers (computers), called compute nodes, which are connected via a fast interconnect.
There may be different types of nodes for different types of tasks. The HPC cluster listed on this wiki has
- a headnode or login node, where users log in
- multiple data nodes
- Cpu compute nodes
- “fat” Cpu compute node with 2TB of memory
- Cpu+Gpu compute nodes (on these nodes computations can be run both on CPU cores and on a Graphical Processing Unit cards)
All cluster nodes have the same components as a laptop or desktop: CPU cores, memory and disk space. The difference between personal computer and a cluster node is in quantity, quality and power of the components.
For more information on the list of used hardware for this cluster, see the EEMCS-HPC Hardware page.
This cluster is based on the Slurm scheduler 21.08.5 running on Ubuntu 22.04 LTS.
Login Nodes
You can connect to one of the headnodes : hpc-head1.ewi.utwente.nl or hpc-head2.ewi.utwente.nl
See the connection info page on how to connect.
Do NOT login to the compute nodes, either directly or through ssh, ONLY on one of the head nodes !!!!
Slurm Scheduler
To monitor the jobs and progress you can use the EEMCS-HPC Slurm dashboard page or the available command line tools like squeue or scontrol.
To use specific resources please check the EEMCS-HPC features, resources and partitions page.
See the Slurm/HPC scheduler info page for more information.
Maintenance
Upcoming maintenance :
- 23 Oct 2024 software updates
- xx Nov 2024 network/storage
During the maintenance day, the whole cluster will go offline.
Alternatives
For smaller experiments and interactive jobs, please try other resources like :
or external providers like :
Access
Who has access?
- Every member of the following will have a raised priority :
- BMS
- EEMCS
- EEMCS-Students
- TNW
- Non members employees, students and guests are allowed as well, but will get a basic priority factor.
To get access, you need to have an AD account of the University of Twente. All students and employees have such an account and they can be arranged for external persons. To get your AD account enabled for these clusters, you need to contact one of the contact persons.
Partitions
Access to the following partitions are limited to the funders during the first year of investment, these can be reached using their partitions.
The HPC/SLURM cluster contains multiple common partitions :
Partition name | available to |
main | All (default) |
dmb | eemcs-dmb |
ram | eemcs-ram |
bdsi | bms-bdsi |
mia | eemcs-mia |
am | eemcs-(dmmp/macs/mast/mia/mms/sor/stat) |
mia-pof | eemcs-mia & tnw-pof |
students | eemcs-students |
* For now the students partition is only for course related work, BSc and/or MSc will have access to the related research group partition.
Check the EEMCS-HPC specifics, partition option on how to select the these.
HPC Priority
The participating groups who have done investments in the HPC cluster, therefore they will have more priority than other groups not participating.
In order to gain more priority, your group can do an investment in the HPC cluster, depending on the kind of investment this will result in :
- Sole usage of the purchased compute node(s) for the time span of approx. one year.
- A higher priority factor related to the total amount of investment, retirement of hardware will reduce the priority factor (after roughly 8 years).
- Participating higher Quality of Service.
This combination will guarantee more priority and calculation time on the cluster.
please consult the corresponding contact for this :
Contact persons.
- Geert Jan Laanstra (EEMCS-DMB/SCS)
- Jan Flokstra (EEMCS-DMB/HMI)
- t.b.d.
- Frederik Reenders (LISA-ITO, only for calamities)
Admin Page
See the EEMCS-HPC Admin page for more information.
Credentials
Accounts
For staff, the username is probably your family name followed by your initials, for students its your student number starting with the “s”, for guest accounts this would be starting with the “x”.
DSI Computing Lab does not store your password and we are unable to reset your password. If you require password assistance, please visit the ICTS/LISA Servicedesk.
Mailing list
For the HPC/SLURM cluster, two mailing lists are created :
- EEMCS-Hpc-Cluster-Users (all the users)
- EEMCS-Hpc-Cluster-Managers (all the managers)
Connecting to the cluster
Access to DSI Computing Lab resources is provided via secure shell (SSH) login.
Most Unix-like operating systems (Mac OS X, Linux, etc) provide an ssh utility by default that can be accessed by typing the command ssh in a terminal window.
You can connect to one of the headnodes : hpc-head1.ewi.utwente.nl or hpc-head2.ewi.utwente.nl
See the connecting page for more information.
Setting up
Software.
The cluster machines run on Ubuntu Server 22.04 lts, some basic packages in the repositories have been installed. Additional software is available using module files.
See the EEMCS-HPC Software page for more information.
Storage
The following folders are available :
- Network wide personal folder :
- /home/<username> Home folder : This is where you store your code and project related materials, “small” amount of data is allowed within your home folder.
- Don’t keep data for longer periods, get rid of bad results as soon as possible.
- If it contains a static dataset, it should be moved to /deepstore/datasets/……
- Network wide global folder :
- /deepstore/datasets Dataset folder : This is the location for mainly static and/or large data(sets).
- Datasets should be stored here, preferably not in your user folder.
- New folders are available on request.
- Network wide global folder :
- /deepstore/projects Projects folder : Shared directories for projects, shared word area.
- New folders are available on request.
- Local scratch folder :
- /local/<jobid> (preferred) or /local/<username> or /local/<projectname> scratch folder, use this space to store intermediate data during a job run to speed up processing and reduce network traffic.
- At the start of your job, you can create a local folder and storage temporary data here.
- At the end of the job, you should remove your data and created folders.
Quota
Quota is activated on the /home/<username> folder, this means we limit the amount of data in your personal folder.
- below 1TB : This is fine, keep your data size below this threshold, clean up if possible !
- Between 1TB..2TB : You will get a warning if your folder reaches more than 1TB, this warning will be valid for a grace period of 4 weeks, after this writing will be blocked.
- Over 2TB : Writing will be blocked, you will definitely have to remove data.
Submitting Jobs
Batch Jobs
Slurm sbatch is used to submit a job script for later execution.
The script will typically contain the scheduler parameters, setup commands and the processing task(s) or (if required) multiple Slurm srun commands to launch parallel tasks.
See the Slurm sbatch and Slurm srun wiki page for more details.
Before submitting jobs please note the maximum number of jobs and resources related to your accounts Quality Of Service (QOS).
These numbers can be obtained from the QOS tab on the EEMCS-HPC Slurm dashboard page.
Interactive Jobs
It is possible to request for an interactive job, within this job you can execute small experiments. Use this only for a short time (max 1 hour).
For this you can use the additional Slurm sinteractive command.
Monitoring Jobs
The following commands are located in the software module monitor/node, you should load them on beforehand. Check the Monitoring Computenodes page for more information.
During the job
You can monitor your jobs using the
- realtime cpu monitor : top-node <nodename>
- realtime gpu monitor : nvtop-node <nodename>
- snapshot gpu monitor : nvidia-smi-node <nodename>
After the job
When your job is finished you can check the :
- content of your jobs logfile
- jobs efficiency using : seff <jobnumber>