03-20, 11:30–11:40 (US/Eastern), Tsai Auditorium (CGIS S010)
An up-to-the-minute status report of the configuration and availability of HPC resources can be useful for selecting resource parameters in interactive apps. GPU node resources can be low at times leaving few or no GPUs available for immediate use when launching interactive apps. By providing the status of readily available GPU node resources such as the type and number of GPUs, CPU cores and GB of memory, users can more efficiently utilize cluster resources and minimize job queue wait times.
Efficient selection and utilization of HPC resources is essential for minimizing job queue wait times and maximizing productivity. Selecting appropriate GPU resource parameters in interactive apps can be challenging, especially when GPU resources are limited. Without real-time information on GPU configurations and current availability, including GPU types, counts, CPU cores, and GB of memory, users may submit suboptimal resource requests that lead to longer job queue wait times. To address this challenge, we configured a Passenger app within the Open OnDemand framework that provides real-time status of GPU compute node resources. Building upon the example Passenger app provided by the Ohio Supercomputer Center (https://github.com/OSC/ood-example-ps) and utilizing a custom script (https://github.com/tamu-edu/dor-hprc-ood-apps/tree/main/gpuavail), this status app queries the Slurm workload manager configuration to retrieve current GPU node configurations and availability. The app displays the information in two comprehensive tables:
-
Configuration Table: Details the types and numbers of GPUs attached to compute nodes, along with the count of nodes sharing each GPU configuration.
-
Availability Table: Lists GPU compute nodes and resources currently available for new jobs, including the GPU types and counts, CPU cores, and available memory in GB.
The gpuavail Passenger app equips users with up-to-the-minute information on GPU resource availability, enabling informed decision-making when selecting resource parameters for GPU-supported interactive apps. By aligning resource requests with actual cluster availability, users can reduce job queue wait times and enhance overall cluster efficiency.