Universities and HPC

Like some (or even many?) of you, I work at a university.

In my research group, we use the university HPC for most of our model simulations, and then use our own servers to archive and host the data, since it is difficult to get enough storage on the HPC machines we use. From there, we use the Pangeo stack of tools to analyze the model results.

We use the university HPC machines because they are free (for us), resources are plentiful for researchers with funded projects, and the High Performance Research Computation group at TAMU is fantastic. But I still wonder if there are changes I could make to improve the workflow. Keep the data on the cluster? Move the data to the cloud instead of local machines? Put all my data on a desktop RAID array?

I would like to start a discussion about this sort of workflow. I’m interested to hear what others do, what works and what doesn’t? To inspire some discussion, see this twitter thread (I laughed out loud at take number 2).

1 Like

Great question Rob. We are in a very similar situation at Columbia, struggling with the same questions. We now have data spread between the HPC system, a big data server, and Pangeo cloud.

Assuming a lifetime of five years for the HPC cluster and the group server. our storage cost on each system is roughly as follows

platform storage cost ($ / TB / year)
Group Server 15
University HPC 50
Google Cloud Storage 250

The group server is by far the cheapest, which is why these are so common. However, we also found that our productivity could be quite limited on the group server. When you have 10 group members trying to churn through many TB of data all at the same time, the the group server can just grind to a halt.

I believe that the university HPC represents a pretty good tradeoff. We can do data processing in parallel, with a fast filesystem, at about 1/5 of the cost per TB of cloud storage. This is especially convenient of you are already running models on the HPC so have data there already.

There are many things that can be done in order to make the HPC server more “cloud like” and thus support Pangeo workflows better. Matt Rocklin has a great post on this topic here:
http://matthewrocklin.com/blog/work/2019/10/01/stay-on-hpc

The key to achieving this is to change the culture of the HPC center and get them to value the things we value in Pangeo, like web-based access, containerization, elasticity, etc. That has not been in easy in my experience. Even large HPC centers are conservative. The smaller centers are even more conservative, because they don’t have the expertise to tweak their configurations, and don’t have a mandate to push the envelope on HPC technology.

Maybe some others can share their experience running Pangeo on university HPC systems.

Matt Rocklin’s blog post, and prior discussions at scipy were the things that got me thinking about this issue in the first place, but I forgot to link in my post. Thanks Ryan. That should be required reading for anyone using university HPC systems.

As an update, we met with some folks from our HPC center today, and showed them our workflow. We also sent them a link to Anderson Banihirwe’s SciPy 2019 talk on making HPC clusters more on-demand, and therefore more cloud-like. They are very interested in doing this. It will be interesting to see if this is a trend.

1 Like

That’s a very interesting discussion!

From the CNES (French space agency) computing center perspective, we’ve already taken the path of mixing traditional HPC workload with more interactive and data driven ones. Actually, we’ve got a lot more data driven pipelines than big MPI simulations, mainly because one of our job is to produce usable data from our satellite’s instruments. So our current platform is designed for this in two ways:

  • Big focus on the storage component: huge available volume and bandwidth compared to the amount of processing power, or at least this is how I feel when I compare the ratio between compute and storage with other HPC platforms. Users are encouraged to leave the produced data on this storage to do science on it.
  • Don’t focus on cluster total occupancy, but instead try to ease the interactive usage of resources as what is recommended by Pangeo.

We’re planning to provide in the next few years a Datalake kind of storage solution, probably an object store like one, with data indexation and cataloging that would be accessible from our HPC cluster (and beyond?) and host the data at a lower cost from traditional HPC solutions like GPFS.

But from what I can see in the french academic HPC ecosystem, we are kind of lucky of this (see https://medium.com/pangeo/the-2019-european-pangeo-meeting-437170263123). This is probably because we’re an organization that is not fully academic. Other centers I know have to justify their usefulness by providing occupancy statistics for instance.

But I also see things slowly changing. The data word in in every mouth now, and historical centers are more and more involved in exploring different ways of doing things. I’m participating in an event rassembling french academic HPC actors this week, and in the first day I’ve heard the word datalake in already three different talks. Once this is acted that HPC center have to provide a mean to store and ease access to the produced data, the need to provide tools to efficently analyse it will impose itself too.

2 Likes

I really enjoyed the points you brought up. I learned the word ‘datalake’, and was particularly interested in the European Pangeo meeting notes. I suspect if we can make HPC systems behave in a more on-demand way, with short but intense bursts of computation, the motivation to move to the cloud will be greatly reduced for many academic-like researchers. It seems your HPC groups is doing this, and so this could be a good model to emulate. Anything you can share on this point would help our local administrators as a goal to work toward.

As a side point, I was also very interested in the zarr vs. netcdf discussion in the European Pangeo meeting notes. I have found very similar things (it’s hard to get performant netcdf), and would like some more advice on getting performant zarr (though it seems like there is less to wiggle here). We should probably make a new topic to discuss this.

There are probably several things to change to achieve the goal of interactive HPC:

  • Accept to have a lower cluster occupancy, which means accepting that cluster usage is not the only metric to take into account when considering investments.
  • Configure you queuing system for short interactive (or not) jobs. In our case, we have more than a third of our resources dedicated to jobs that last less than 1hour. This allows fast turnover of this computing power.
  • Provide tooling that allows bursting like Dask.

Again, I think we are lucky enough at CNES to have enough power to satisfy both interactive and no interactive demand. We also have a majority of use cases that are compatible with that, I would say that only 10 to 20% of our jobs are operational like, needing to produce results in a given time slot, and using asinificant amount of resources. The rest is either development usage (maybe 60%), so often interactive or close to (wait for a few hours to get results), or long running production with no tight schedule (so they don’t need high priority and can run at night).

Ultimately, we should share things in https://github.com/pangeo-data/pangeo-for-hpc, but I’m not sure what I said above is worth it…

We can make a new topic, or discuss this in https://github.com/pangeo-data/benchmarking.