Table of Contents

Monitoring

On the EEMCS-HPC Cluster are several modules available to monitor jobs and computing hardware.

The scheduler based utilities can be loaded with the following command :

module load slurm/utils

The compute node utilities can be loaded with the following command :

module load monitor/node

Scheduler/Jobs

cluster dashboard

The overall status of jobs (queued/running), racks, mapping, partitions, qos and reservations can be viewed accessing the EEMCS-HPC Slurm dashboard page.

sinfo

This command is used to view partition and node information.

squeue

This command is used to view job and job step information for jobs managed by the Slurm scheduler.

scancel

This command is used to signal jobs or job steps that are under the control of the Slurm scheduler.
For example to cancel your job run the following command :

scancel <job-id>

scontrol

This command is used view or modify the Slurm configuration and state.

seff

When the job is finished, the efficiency of a job can be viewed with the seff perl script.

seff <job-id>

Compute nodes

top-node

With this shell script you can see what your Cpu processes are on the specified node:

top-node <node-name> [+optional top arguments]

nvtop-node

With this shell script you can see what your Gpu processes are on the specified node:

nvtop-node <node-name> [+optional nvtop arguments]

nvidia-smi

With this shell script you can request info of the Gpus listed on the specified node:

nvidia-smi-node <node-name> [+optional nvidia-smi arguments]