Monitoring

On the EEMCS-HPC Cluster are several modules available to monitor jobs and computing hardware.

The scheduler based utilities can be loaded with the following command :

module load slurm/utils

The compute node utilities can be loaded with the following command :

module load monitor/node

The overall status of jobs (queued/running), racks, mapping, partitions, qos and reservations can be viewed accessing the EEMCS-HPC Slurm dashboard page.

This command is used to view partition and node information.

This command is used to view job and job step information for jobs managed by the Slurm scheduler.

This command is used to signal jobs or job steps that are under the control of the Slurm scheduler.
For example to cancel your job run the following command :

scancel <job-id>

This command is used view or modify the Slurm configuration and state.

When the job is finished, the efficiency of a job can be viewed with the seff perl script.

seff <job-id>

With this shell script you can see what your Cpu processes are on the specified node:

top-node <node-name> [+optional top arguments]

With this shell script you can see what your Gpu processes are on the specified node:

nvtop-node <node-name> [+optional nvtop arguments]

With this shell script you can request info of the Gpus listed on the specified node:

nvidia-smi-node <node-name> [+optional nvidia-smi arguments]

Monitoring

Scheduler/Jobs

cluster dashboard

sinfo

squeue

scancel

scontrol

seff

Compute nodes

top-node

nvtop-node

nvidia-smi