Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

selipot · May 3, 2020, 1:07am

Thank you @aimeeb @Thomas_Moore @cgentemann for connecting me and taking the time to share your experience and recommendations. I have a lot to digest but perhaps a few clarifications needed:

What does vertical scaling mean in your instances?
About zarr: would you create a single zarr array/store with the model global dataset or should I divide it up in geographical regions? From what you are describing @aimeeb it seems that you create a single array and append to it along the time dimension? In the end I want to analyze/use my entire global model data.
Right now I am lazily loading the netcdf files to form a xarray dataset which I convert to zarr using the xarray.to_zarr method, setting chunks in whatever way I think should work. With my probably-not-adequate cluster configuration, creating a zarr array for 1/20-th of my model geographical area has now been going on for 8 hours.

Topic		Replies	Views
HPC Time series processes Science	5	1025	February 10, 2020
CMIP6 Zarr datasets on AWS — useful for interactive exploration? Data	1	910	June 10, 2021
Cloud-optimized Eulerian+Lagrangian dataset freely accessible News & Announcements	1	134	September 11, 2024
Pangeo showcase: "HYCOM-OceanTrack: From 17,518 NetCDF files to an Analysis-Ready Cloud-Optimized dataset in the cloud?" Pangeo Showcase	0	193	October 11, 2024
Xarray and compression options for large NetCDF files Data	8	3775	March 8, 2022

Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

Related topics