Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

Thomas_Moore · May 2, 2020, 10:57pm

Hi Shane,

Hopefully this can be helpful? I am also on the journey of understanding how to make things work better myself and I’m thankful to have gleaned knowledge from others.

We generate and work with 6D (3D + ensemble + start date + lead time) forecasts and 5D (3D + ensemble + time) coupled ocean-atmosphere output where single variables can be ~5TB+, complete 6D forecast experiments are order 50+TB range, and 5D reanlyses are in the multi-100TB range.

We went on a journey to figure out how to deal with this data tsunami. Luckily we found the Pangeo community and have been making progress with lots of help. In our team I work with @dougiesquire and we’ve taken what we’ve learned from our Pangeo colleagues to build an HPC workflow internal to our organisation.

Here are the main dot-points from my POV.

conversion from NetCDF to zarr has really enabled our work
chunking (and re-chunking depending on your calculation) matters
there are still hurdles I don’t fully understand wrt horizontal scaling in some calculations

I’ve mainly taken my directions and absorbed knowledge from folks like @rabernat, @dougiesquire, @andersy005, @TomAugspurger, @mrocklin, @jhamman, @kaedonkers, @pbranson and others in the community.

WRT #3 we’ve had issues with horizontal scaling using our org’s new super-fast parallel storage. My very ignorant POV is that it has something fundamental to do with how Dask bundles up tasks and the timing at which the workers communicate with each other? I think it’s related to:
https://github.com/dask/distributed/issues/2602 ?

I also note this image from a recent @andersy005 presentation: https://pbs.twimg.com/media/EWs6LCXUYAEqbZc?format=jpg&name=large

Our current solution is vertical scaling using the largest node we can get as a single worker with as many threads & RAM as the largest node can provide.

To sum up this ramble - from my POV:

work on converting NetCDF to zarr
chunking (and re-chunking depending on situation) matters to get your calculations to finish quickly or at all
there may be some underlying software engineering challenges that could be holding you/us back

Very much welcome being corrected on my point of view, BTW

Topic		Replies	Views
HPC Time series processes Science	5	1025	February 10, 2020
CMIP6 Zarr datasets on AWS — useful for interactive exploration? Data	1	909	June 10, 2021
Cloud-optimized Eulerian+Lagrangian dataset freely accessible News & Announcements	1	134	September 11, 2024
Pangeo showcase: "HYCOM-OceanTrack: From 17,518 NetCDF files to an Analysis-Ready Cloud-Optimized dataset in the cloud?" Pangeo Showcase	0	193	October 11, 2024
Xarray and compression options for large NetCDF files Data	8	3770	March 8, 2022

Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

Related topics