GOOD 2025

Scaling up OOD - a sysadmin's perspective
03-18, 16:30–16:55 (US/Eastern), Tsai Auditorium (CGIS S010)

A walk-through of concrete issues and performance improvements
which CSC has identified when approaching a scale of hundreds of concurrent
users on Open OnDemand. The aim is to inform system administrators and developers
who are operating an OOD instance about potential pitfalls, as well as quirks
which only become visible at a larger scale.

This technical talk will consider site-specific issues, code-specific issues in the OOD
upstream, as well as architectural impacts of using Passenger.
A general understanding of the OOD architecture, Passenger's role in it, and
Linux systems programming is beneficial.


CSC - IT Center for Science Ltd. is currently hosting three supercomputers,
LUMI (#8 on the 11/2024 Top500 list) as well as two national supercomputers.
CSC has been using Open OnDemand since 2021, and has since deployed OOD on
all three supercomputers. The use of OOD has evolved over the years, and it
has become an integral part of our service offering for users of the HPC clusters.

The popularity of OOD among users has grown, and this has caused some interesting
performance issues. In November of 2024, the busiest of our national supercomputers,
Puhti, had around 1500 unique users logging into OOD. It is not uncommon to have
up to 200 concurrent users on Puhti's web interface, and at times there have been
issues with high system load on the web server.

This talk will go over a few major performance improvements that CSC has found
helpful both in the past, but also some future improvements that have been
identified. The talk will mention Passenger quirks as part of OOD's architecture,
the importance of running nginx_clean, as well as the impact of a fast
user-mapper executable.

The information will be quite technical in nature and is aimed towards operators
of an OOD instance, e.g., system administrators and developers.
The talk will include concrete suggestions for administrators on how to configure
their OOD instances to avoid pitfalls when reaching a higher user count.

Simon is a System specialist at CSC - IT Center for Science Ltd. He has been participating in the roll-out and operations of Open OnDemand on three different supercomputers since 2021, including the pre-exascale system LUMI. CSC operates OOD via a containerized deployment, which creates interesting opportunities for testing and development for OOD. CSC is an active contributor to the OOD project, including code submissions, issue reporting, and testing.