03-18, 14:30–14:55 (US/Eastern), Belfer Case Study Room (CGIS S020)
Adopting a new software platform that has the power to change the way teaching and research is done is rarely easy. In this talk, we explain how Princeton University Research Computing started with Open OnDemand (OOD), what we contributed, and what we learned. Hear about our problems, solutions and continued pain points.
OOD at Princeton started over 8 years ago as an attempt to make teaching easier. It quickly blossomed into an essential new way to not only teach but to facilitate research, for expert users but especially beginners. We have reached the point where in 2023, on our largest general-purpose HPC cluster, 41% of users ran at least one job with OOD.
The journey to that point involved a lot of learning about OOD but also about the needs of our user community. We now have Jupyter OOD apps that offer numerous different setups (e.g., per course, per Anaconda version) and that largely auto-detect user environments. Remote desktop via OOD uses our dedicated visualization nodes and a systemd-based scheduler that we contributed. MATLAB, Mathematica, RStudio, Stata and other apps are also available. File quota reports are available through a web interface as well.
To improve utilization of our clusters and help users optimize their jobs, we developed Jobstats, which is a job monitoring platform that includes OOD helper apps. For a given job ID, we have an app that generates a URL to a Grafana dashboard that shows detailed job metrics as a function of time. These metrics include the CPU/GPU utilization and CPU/GPU memory usage as well as useful node-level metrics. The dashboard is helpful when trying to understand why a job failed and for troubleshooting system performance issues.
To meet the recent demand for GPUs for classes and training workshops, we have configured our local OOD implementation to burst to the cloud. We will discuss the various unexpected problems that arose while creating this service as well as their solutions. Feedback from instructors and students will be presented.
While OOD is working well for the majority of use cases, a certain set of issues continually arise. For instance, we receive a steady stream of support tickets from users with sessions that will not start. While support staff is capable of resolving the matter, more can be done with OOD to identify the problem and present the solution to the user in an understandable way. Other recurring issues and our efforts to address them will be presented.
Jonathan Halverson is the Training Lead for the Princeton Institute of Computational Science and Engineering at Princeton University.