GOOD 2025

Integration of Open OnDemand with the Jobstats Job Monitoring Platform
03-19, 17:20–17:30 (US/Eastern), Tsai Auditorium (CGIS S010)

Given the ever increasing cost of compute (especially GPUs) it is imperative that these resources are used efficiently. How can this be achieved in a simple-to-use way on a high-performance computing cluster that supports a large number of diverse researchers? Our solution is Jobstats, a job monitoring platform which integrates with Open OnDemand (OOD). This talk will provide an overview of the platform and its various components while concentrating on its links with OOD. Our planned extensions for the OOD integration will be presented with the hope of receiving feedback and new ideas from attendees.


The inefficient use of compute resources is surely as old as the first Beowulf cluster. Typical causes for underutilization include jobs that over-allocate CPUs, CPU memory, or both. Such instances are frequently due to misunderstandings or in some cases by accident. The introduction of GPUs has made the problem more pressing. Two common problems are (1) jobs that allocate GPUs but do not use them and (2) jobs that use them but only with little utilization.

At Princeton, we have started to continuously monitor nodes (including the GPUs) and filesystems using various Prometheus exporters. Our integrated solution is called Jobstats. The monitoring data is combined with the workload manager (Slurm) data to produce job efficiency reports. These reports provide metadata about the job as well as CPU/GPU utilization, CPU/GPU memory usage, and job-specific notes to guide users.

The Jobstat platform includes an OOD Helper App. For a given job ID, the app generates a URL to a Grafana dashboard that shows various job-level and node-level metrics as a function of time. The job-level metrics include the CPU/GPU utilization and CPU/GPU memory usage. Some examples of the node-level metrics are the mean CPU frequencies, GPFS bandwidth statistics, and the number of InfiniBand errors. The dashboard is helpful when trying to understand why a job failed and for troubleshooting system issues. Jobstats also makes it possible to see the high-level job efficiency data on the “Completed Jobs” page of the OOD web interface. The “Active Jobs” page shows plots of CPU utilization and memory usage with a link to the detailed job statistics on Grafana.

The combination of the OOD Helper App, the Grafana interface, and the command-line tools, makes it easy for users and system administrators to inspect the utilization and usage of individual jobs. The Jobstats platform was released in 2023. It is being used by tens of institutions throughout the world including Princeton, Brown and Yale.

Jonathan Halverson is the Training Lead for the Princeton Institute of Computational Science and Engineering at Princeton University.

This speaker also appears in: