====== (EEMCS) HPC Cluster ====== ===== Introduction ===== One of the clusters at the University is the (EEMCS) HPC Cluster. This cluster, funded by the DSI research institute (formerly known as CTIT), started in the year 2017 as joint operation of several research groups to work on deep learning / AI methods. During the years more groups participated and the cluster got expansion with several more nodes containing multi CPU/GPU combinations. The current participating faculties are the BMS, EEMCS, TNW and ET faculty of the University of Twente. This HPC cluster is a collection of many separate servers (computers), called compute nodes, which are connected via a fast interconnect. There may be different types of nodes for different types of tasks. The HPC cluster listed on this wiki has * a headnode or login node, where users log in * multiple data nodes * Cpu compute nodes * "fat" Cpu compute node with 2TB of memory * Cpu+Gpu compute nodes (on these nodes computations can be run both on CPU cores and on a Graphical Processing Unit cards) All cluster nodes have the same components as a laptop or desktop: CPU cores, memory and disk space. The difference between personal computer and a cluster node is in quantity, quality and power of the components. For more information on the list of used hardware for this cluster, see the **[[eemcs-hpc:hardware|EEMCS-HPC Hardware page]]**. This cluster is based on the **[[https://slurm.schedmd.com/archive/slurm-21.08.5/|Slurm scheduler 21.08.5]]** running on Ubuntu 22.04 LTS. ===== Login Nodes ===== You can connect to one of the headnodes : **hpc-head1.ewi.utwente.nl** or **hpc-head2.ewi.utwente.nl**\\ See the **[[:connecting| connection info page]]** on how to connect.\\ ''**Do NOT login to the compute nodes, either directly or through ssh, ONLY on one of the head nodes !!!!**'' ===== Slurm Scheduler ===== To monitor the jobs and progress you can use the **[[http://hpc-status.ewi.utwente.nl/slurm|EEMCS-HPC Slurm dashboard page]]** or the available command line tools like **squeue** or **scontrol**. To use specific resources please check the **[[eemcs-hpc:specifics|EEMCS-HPC features, resources and partitions]]** page. See the **[[slurm:start|Slurm/HPC scheduler info page]]** for more information. ===== Maintenance ===== Upcoming maintenance : * 23 Oct 2024 software updates * xx Nov 2024 network/storage During the maintenance day, the whole cluster will go offline. ===== Alternatives ===== For smaller experiments and interactive jobs, please try other resources like : * [[https://jupyter.utwente.nl|Jupyter Lab (Utwente)]], [[https://jupyter.wiki.utwente.nl/|Jupyter Lab Wiki (Utwente)]] or external providers like : * [[https://www.kaggle.com|Kaggle (External)]] * [[https://colab.research.google.com|Google Colab (External)]] ===== Access ===== ==== Who has access? ==== * Every member of the following will have a raised priority : * BMS * [[https://bdsi.bms.utwente.nl/|BMS-BDSI]] * EEMCS * [[https://www.utwente.nl/en/eemcs/dmb/|EEMCS-DMB]] * [[https://www.utwente.nl/en/eemcs/fmt/|EEMCS-FMT]] * [[https://www.utwente.nl/en/eemcs/hmi/|EEMCS-HMI]] * [[https://www.utwente.nl/en/eemcs/sacs/people/sort-chair/?category=macs|EEMCS-SACS-MACS]] * [[https://www.utwente.nl/en/eemcs/sacs/people/sort-chair/?category=mast|EEMCS-SACS-MAST]] * [[https://www.utwente.nl/en/eemcs/sacs/people/sort-chair/?category=mia|EEMCS-SACS-MIA]] * [[https://www.utwente.nl/en/eemcs/sacs/people/sort-chair/?category=mms|EEMCS-SACS-MMS]] * [[https://www.utwente.nl/en/eemcs/sor/|EEMCS-MOR-SOR]] * [[https://www.utwente.nl/en/eemcs/dmmp/|EEMCS-MOR-DMMP]] * [[https://www.utwente.nl/en/eemcs/stat/|EEMCS-MOR-STAT]] * [[https://www.utwente.nl/en/eemcs/ps/|EEMCS-PS]] * [[https://www.ram.eemcs.utwente.nl/|EEMCS-RAM]] * [[https://www.utwente.nl/en/eemcs/scs/|EEMCS-SCS]] * EEMCS-Students * TNW * [[https://pof.tnw.utwente.nl/|TNW-POF]] * [[https://www.utwente.nl/en/tnw/bmpi/|TNW-BMPI]] * Non members employees, students and guests are allowed as well, but will get a basic priority factor. To get access, you need to have an AD account of the University of Twente. All students and employees have such an account and they can be arranged for external persons. To get your AD account enabled for these clusters, you need to contact one of the contact persons. ==== Partitions ==== Access to the following partitions are limited to the funders during the first year of investment, these can be reached using their partitions. The HPC/SLURM cluster contains multiple common partitions : | **Partition name** | **available to** | | main | All (default) | | dmb | eemcs-dmb | | ram | eemcs-ram | | bdsi | bms-bdsi | | mia | eemcs-mia | | am | eemcs-(dmmp/macs/mast/mia/mms/sor/stat) | | mia-pof | eemcs-mia & tnw-pof | | students | eemcs-students | * For now the students partition is only for course related work, BSc and/or MSc will have access to the related research group partition. Check the [[eemcs-hpc:specifics#partitions|EEMCS-HPC specifics, partition option]] on how to select the these. ===== HPC Priority ===== The participating groups who have done investments in the HPC cluster, therefore they will have more priority than other groups not participating.\\ In order to gain more priority, your group can do an investment in the HPC cluster, depending on the kind of investment this will result in : * Sole usage of the purchased compute node(s) for the time span of approx. one year. * A higher priority factor related to the total amount of investment, retirement of hardware will reduce the priority factor (after roughly 8 years). * Participating higher Quality of Service. This combination will guarantee more priority and calculation time on the cluster. please consult the corresponding contact for this : ===== Contact persons. ===== * [[https://people.utwente.nl/g.j.laanstra|Geert Jan Laanstra]] (EEMCS-DMB/SCS) * [[https://people.utwente.nl/jan.flokstra|Jan Flokstra]] (EEMCS-DMB/HMI) * t.b.d. * [[https://people.utwente.nl/f.reenders|Frederik Reenders]] (LISA-ITO, only for calamities) === Admin Page ==== See the **[[eemcs-hpc:admin|EEMCS-HPC Admin page]]** for more information. ===== Credentials ===== ==== Accounts ==== For staff, the username is probably your family name followed by your initials, for students its your student number starting with the "s", for guest accounts this would be starting with the "x". DSI Computing Lab does not store your password and we are unable to reset your password. If you require password assistance, please visit the [[https://www.utwente.nl/nl/lisa/ict/servicedesk|ICTS/LISA Servicedesk]]. ==== Mailing list ==== For the HPC/SLURM cluster, two mailing lists are created : * EEMCS-Hpc-Cluster-Users (all the users) * EEMCS-Hpc-Cluster-Managers (all the managers) ===== Connecting to the cluster ===== Access to DSI Computing Lab resources is provided via secure shell (SSH) login. Most Unix-like operating systems (Mac OS X, Linux, etc) provide an **ssh** utility by default that can be accessed by typing the command **ssh** in a terminal window. You can connect to one of the headnodes : **hpc-head1.ewi.utwente.nl** or **hpc-head2.ewi.utwente.nl** See the **[[connecting|connecting page]]** for more information. ===== Setting up ===== ==== Software. ==== The cluster machines run on Ubuntu Server 22.04 lts, some basic packages in the repositories have been installed. Additional software is available using module files. See the **[[eemcs-hpc:software|EEMCS-HPC Software page]]** for more information. ==== Storage ==== The following folders are available : * Network wide personal folder : * **/home/** Home folder : This is where you store your code and project related materials, "small" amount of data is allowed within your home folder. * Don’t keep data for longer periods, get rid of bad results as soon as possible. * If it contains a static dataset, it should be moved to /deepstore/datasets/…… * Network wide global folder : * **/deepstore/datasets** Dataset folder : This is the location for mainly static and/or large data(sets). * Datasets should be stored here, preferably not in your user folder. * New folders are available on request. * Network wide global folder : * **/deepstore/projects** Projects folder : Shared directories for projects, shared word area. * New folders are available on request. * Local scratch folder : * **/local/** (preferred) or **/local/** or **/local/** scratch folder, use this space to store intermediate data during a job run to speed up processing and reduce network traffic. * At the start of your job, you can create a local folder and storage temporary data here. * At the end of the job, you should remove your data and created folders. === Quota === Quota is activated on the **/home/** folder, this means we limit the amount of data in your personal folder. * below 1TB : This is fine, keep your data size below this threshold, clean up if possible ! * Between 1TB..2TB : You will get a warning if your folder reaches more than 1TB, this warning will be valid for a grace period of 4 weeks, after this writing will be blocked. * Over 2TB : Writing will be blocked, you will definitely have to remove data. ==== Submitting Jobs ==== === Batch Jobs === **[[slurm:sbatch|Slurm sbatch]]** is used to submit a job script for later execution.\\ The script will typically contain the scheduler parameters, setup commands and the processing task(s) or (if required) multiple **[[slurm:srun|Slurm srun]]** commands to launch parallel tasks.\\ See the **[[slurm:sbatch|Slurm sbatch]]** and **[[slurm:srun|Slurm srun]]** wiki page for more details. Before submitting jobs please note the maximum number of jobs and resources related to your accounts Quality Of Service (QOS).\\ These numbers can be obtained from the QOS tab on the **[[http://hpc-status.ewi.utwente.nl/slurm|EEMCS-HPC Slurm dashboard page]]**. === Interactive Jobs === It is possible to request for an interactive job, within this job you can execute small experiments. Use this only for a short time (max 1 hour).\\ For this you can use the additional **[[slurm:sinteractive|Slurm sinteractive]]** command. ==== Monitoring Jobs ==== The following commands are located in the software module **monitor/node**, you should load them on beforehand. Check the **[[eemcs-hpc:monitor|Monitoring Computenodes]]** page for more information. === During the job === You can monitor your jobs using the * **[[http://hpc-status.ewi.utwente.nl/slurm|EEMCS-HPC Slurm dashboard page]]** * realtime cpu monitor : **top-node ** * realtime gpu monitor : **nvtop-node ** * snapshot gpu monitor : **nvidia-smi-node ** === After the job === When your job is finished you can check the : * content of your jobs logfile * jobs efficiency using : **seff **