Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

Thank you @aimeeb @Thomas_Moore @cgentemann for connecting me and taking the time to share your experience and recommendations. I have a lot to digest but perhaps a few clarifications needed:

  1. What does vertical scaling mean in your instances?
  2. About zarr: would you create a single zarr array/store with the model global dataset or should I divide it up in geographical regions? From what you are describing @aimeeb it seems that you create a single array and append to it along the time dimension? In the end I want to analyze/use my entire global model data.
  3. Right now I am lazily loading the netcdf files to form a xarray dataset which I convert to zarr using the xarray.to_zarr method, setting chunks in whatever way I think should work. With my probably-not-adequate cluster configuration, creating a zarr array for 1/20-th of my model geographical area has now been going on for 8 hours.
1 Like