Monitoring
On the EEMCS-HPC Cluster are several modules available to monitor jobs and computing hardware.
The scheduler based utilities can be loaded with the following command :
module load slurm/utils
The compute node utilities can be loaded with the following command :
module load monitor/node
Scheduler/Jobs
cluster dashboard
The overall status of jobs (queued/running), racks, mapping, partitions, qos and reservations can be viewed accessing the EEMCS-HPC Slurm dashboard page.
sinfo
This command is used to view partition and node information.
squeue
This command is used to view job and job step information for jobs managed by the Slurm scheduler.
scancel
This command is used to signal jobs or job steps that are under the control of the Slurm scheduler.
For example to cancel your job run the following command :
scancel <job-id>
scontrol
This command is used view or modify the Slurm configuration and state.
seff
When the job is finished, the efficiency of a job can be viewed with the seff perl script.
seff <job-id>
Compute nodes
top-node
With this shell script you can see what your Cpu processes are on the specified node:
top-node <node-name> [+optional top arguments]
nvtop-node
With this shell script you can see what your Gpu processes are on the specified node:
nvtop-node <node-name> [+optional nvtop arguments]
nvidia-smi
With this shell script you can request info of the Gpus listed on the specified node:
nvidia-smi-node <node-name> [+optional nvidia-smi arguments]