Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

Hi Shane,

Hopefully this can be helpful? I am also on the journey of understanding how to make things work better myself and I’m thankful to have gleaned knowledge from others.

We generate and work with 6D (3D + ensemble + start date + lead time) forecasts and 5D (3D + ensemble + time) coupled ocean-atmosphere output where single variables can be ~5TB+, complete 6D forecast experiments are order 50+TB range, and 5D reanlyses are in the multi-100TB range. :boom:

We went on a journey to figure out how to deal with this data tsunami. Luckily we found the Pangeo community and have been making progress with lots of help. In our team I work with @dougiesquire and we’ve taken what we’ve learned from our Pangeo colleagues to build an HPC workflow internal to our organisation.

Here are the main dot-points from my POV.

  1. conversion from NetCDF to zarr has really enabled our work
  2. chunking (and re-chunking depending on your calculation) matters
  3. there are still hurdles I don’t fully understand wrt horizontal scaling in some calculations

I’ve mainly taken my directions and absorbed knowledge from folks like @rabernat, @dougiesquire, @andersy005, @TomAugspurger, @mrocklin, @jhamman, @kaedonkers, @pbranson and others in the community.

WRT #3 we’ve had issues with horizontal scaling using our org’s new super-fast parallel storage. My very ignorant POV is that it has something fundamental to do with how Dask bundles up tasks and the timing at which the workers communicate with each other? I think it’s related to:
https://github.com/dask/distributed/issues/2602 ?

I also note this image from a recent @andersy005 presentation: https://pbs.twimg.com/media/EWs6LCXUYAEqbZc?format=jpg&name=large

Our current solution is vertical scaling using the largest node we can get as a single worker with as many threads & RAM as the largest node can provide. :woman_shrugging:t4:

To sum up this ramble - from my POV:

  1. work on converting NetCDF to zarr
  2. chunking (and re-chunking depending on situation) matters to get your calculations to finish quickly or at all
  3. there may be some underlying software engineering challenges that could be holding you/us back

Very much welcome being corrected on my point of view, BTW

4 Likes