What are the best practices to store data in zarr and how scalable is zarr?

I know Zarr performs better, especially for cloud workflows, but I have a few practical concerns.

I already have large archives of NetCDF/HDF files, and converting everything to Zarr will itself take a significant amount of time and resources.(I have used virtualizarr for the same data and it works great and takes much less time when compared to converting to a actual zarr store)

So my main questions are:

  • Is it better to convert each individual NetCDF/HDF file into separate Zarr stores, or should I stack everything into a single Zarr dataset?

  • I tested a single consolidated Zarr store with around 2–3 TB of gridded geostationary satellite data, and it worked very fast.

  • But what happens at much larger scales, like petabytes?
    Will a single Zarr store still be scalable and efficient, or does it become a bottleneck?

  • And what if data is not on a single grid that is it’s data from leo sats such as cloudsat ,gpm dpr etc ,we will not be able to use standard xarray funcs on it such as .slice etc

I’m trying to understand what’s the more practical and scalable approach in the long run.

1 Like

Hey there @Kaboom_Official, I can’t answer all your Q’s, but a few thoughts.

Is it better to convert each individual NetCDF/HDF file into separate Zarr stores, or should I stack everything into a single Zarr dataset?

IMO a big benefit of Zarr is the ability to have a single access point for a large data cube. This way a user doesn’t have to figure out how to concat/merge all of the NetCDFs into a dataset.

(I have used virtualizarr for the same data and it works great and takes much less time when compared to converting to a actual zarr store)

If you’re happy with your NetCDF chunking and data pipeline, VirtualiZarr should be a great fit. If you store your virtual Zarr stores in Icechunk, appending is an easy and safe operation. It gives you Zarr like performance and convenience without a total rewrite of your data. As far as the scaling goes, @TomNicholas has done some tests on this and it seems like you would need a absurd number of references before you ran into issues.

I’m sure others can chime in with more thoughts!

1 Like