Like some (or even many?) of you, I work at a university.
In my research group, we use the university HPC for most of our model simulations, and then use our own servers to archive and host the data, since it is difficult to get enough storage on the HPC machines we use. From there, we use the Pangeo stack of tools to analyze the model results.
We use the university HPC machines because they are free (for us), resources are plentiful for researchers with funded projects, and the High Performance Research Computation group at TAMU is fantastic. But I still wonder if there are changes I could make to improve the workflow. Keep the data on the cluster? Move the data to the cloud instead of local machines? Put all my data on a desktop RAID array?
I would like to start a discussion about this sort of workflow. Iām interested to hear what others do, what works and what doesnāt? To inspire some discussion, see this twitter thread (I laughed out loud at take number 2).
1 Like
Great question Rob. We are in a very similar situation at Columbia, struggling with the same questions. We now have data spread between the HPC system, a big data server, and Pangeo cloud.
Assuming a lifetime of five years for the HPC cluster and the group server. our storage cost on each system is roughly as follows
platform |
storage cost ($ / TB / year) |
Group Server |
15 |
University HPC |
50 |
Google Cloud Storage |
250 |
The group server is by far the cheapest, which is why these are so common. However, we also found that our productivity could be quite limited on the group server. When you have 10 group members trying to churn through many TB of data all at the same time, the the group server can just grind to a halt.
I believe that the university HPC represents a pretty good tradeoff. We can do data processing in parallel, with a fast filesystem, at about 1/5 of the cost per TB of cloud storage. This is especially convenient of you are already running models on the HPC so have data there already.
There are many things that can be done in order to make the HPC server more ācloud likeā and thus support Pangeo workflows better. Matt Rocklin has a great post on this topic here:
http://matthewrocklin.com/blog/work/2019/10/01/stay-on-hpc
The key to achieving this is to change the culture of the HPC center and get them to value the things we value in Pangeo, like web-based access, containerization, elasticity, etc. That has not been in easy in my experience. Even large HPC centers are conservative. The smaller centers are even more conservative, because they donāt have the expertise to tweak their configurations, and donāt have a mandate to push the envelope on HPC technology.
Maybe some others can share their experience running Pangeo on university HPC systems.
Matt Rocklinās blog post, and prior discussions at scipy were the things that got me thinking about this issue in the first place, but I forgot to link in my post. Thanks Ryan. That should be required reading for anyone using university HPC systems.
As an update, we met with some folks from our HPC center today, and showed them our workflow. We also sent them a link to Anderson Banihirweās SciPy 2019 talk on making HPC clusters more on-demand, and therefore more cloud-like. They are very interested in doing this. It will be interesting to see if this is a trend.
1 Like
Thatās a very interesting discussion!
From the CNES (French space agency) computing center perspective, weāve already taken the path of mixing traditional HPC workload with more interactive and data driven ones. Actually, weāve got a lot more data driven pipelines than big MPI simulations, mainly because one of our job is to produce usable data from our satelliteās instruments. So our current platform is designed for this in two ways:
- Big focus on the storage component: huge available volume and bandwidth compared to the amount of processing power, or at least this is how I feel when I compare the ratio between compute and storage with other HPC platforms. Users are encouraged to leave the produced data on this storage to do science on it.
- Donāt focus on cluster total occupancy, but instead try to ease the interactive usage of resources as what is recommended by Pangeo.
Weāre planning to provide in the next few years a Datalake kind of storage solution, probably an object store like one, with data indexation and cataloging that would be accessible from our HPC cluster (and beyond?) and host the data at a lower cost from traditional HPC solutions like GPFS.
But from what I can see in the french academic HPC ecosystem, we are kind of lucky of this (see https://medium.com/pangeo/the-2019-european-pangeo-meeting-437170263123). This is probably because weāre an organization that is not fully academic. Other centers I know have to justify their usefulness by providing occupancy statistics for instance.
But I also see things slowly changing. The data word in in every mouth now, and historical centers are more and more involved in exploring different ways of doing things. Iām participating in an event rassembling french academic HPC actors this week, and in the first day Iāve heard the word datalake in already three different talks. Once this is acted that HPC center have to provide a mean to store and ease access to the produced data, the need to provide tools to efficently analyse it will impose itself too.
2 Likes
I really enjoyed the points you brought up. I learned the word ādatalakeā, and was particularly interested in the European Pangeo meeting notes. I suspect if we can make HPC systems behave in a more on-demand way, with short but intense bursts of computation, the motivation to move to the cloud will be greatly reduced for many academic-like researchers. It seems your HPC groups is doing this, and so this could be a good model to emulate. Anything you can share on this point would help our local administrators as a goal to work toward.
As a side point, I was also very interested in the zarr vs. netcdf discussion in the European Pangeo meeting notes. I have found very similar things (itās hard to get performant netcdf), and would like some more advice on getting performant zarr (though it seems like there is less to wiggle here). We should probably make a new topic to discuss this.
There are probably several things to change to achieve the goal of interactive HPC:
- Accept to have a lower cluster occupancy, which means accepting that cluster usage is not the only metric to take into account when considering investments.
- Configure you queuing system for short interactive (or not) jobs. In our case, we have more than a third of our resources dedicated to jobs that last less than 1hour. This allows fast turnover of this computing power.
- Provide tooling that allows bursting like Dask.
Again, I think we are lucky enough at CNES to have enough power to satisfy both interactive and no interactive demand. We also have a majority of use cases that are compatible with that, I would say that only 10 to 20% of our jobs are operational like, needing to produce results in a given time slot, and using asinificant amount of resources. The rest is either development usage (maybe 60%), so often interactive or close to (wait for a few hours to get results), or long running production with no tight schedule (so they donāt need high priority and can run at night).
Ultimately, we should share things in GitHub - pangeo-data/pangeo-for-hpc: Instructions and boilerplate for running Pangeo on HPC platforms, but Iām not sure what I said above is worth itā¦
We can make a new topic, or discuss this in GitHub - pangeo-data/benchmarking: Benchmarking & Scaling Studies of the Pangeo Platform.
@rabernat : just stumbled on your October 2019 reply re: platform storage costs.
IMO understanding the facts around true costs remains a vital bit of information for strategic decisions.
Would you say these breakdowns remain the same / similar 18 months later? Is your basic perspective on these cost breakdowns and approaches the same here in 2021?
Thanks for your valuable, informed views.