I know Zarr performs better, especially for cloud workflows, but I have a few practical concerns.
I already have large archives of NetCDF/HDF files, and converting everything to Zarr will itself take a significant amount of time and resources.(I have used virtualizarr for the same data and it works great and takes much less time when compared to converting to a actual zarr store)
So my main questions are:
-
Is it better to convert each individual NetCDF/HDF file into separate Zarr stores, or should I stack everything into a single Zarr dataset?
-
I tested a single consolidated Zarr store with around 2–3 TB of gridded geostationary satellite data, and it worked very fast.
-
But what happens at much larger scales, like petabytes?
Will a single Zarr store still be scalable and efficient, or does it become a bottleneck? -
And what if data is not on a single grid that is it’s data from leo sats such as cloudsat ,gpm dpr etc ,we will not be able to use standard xarray funcs on it such as .slice etc
I’m trying to understand what’s the more practical and scalable approach in the long run.