OPeNDAP vs. direct file access

I think we are something important here. The datasets we are talking about are typically ~10 TB in size. The “reference filesystem” takes a single NetCDF file and makes it work similar speed as an equivalent Zarr. These doesn’t change the fact that we would still typically have hundreds or thousands of NetCDF files. Assembling these all into a coherent datacube would require reading the “reference filesystem” for every file and then checking the coordinates alignment (e.g. whatever xr.open_mfdataset does). This will make it expensive to do the dataset initialization.

The alternative is creating a single 10TB NetCDF file. However, that creates another problem: writing! How do you append a new variable or new record along a dimension to such a dataset? The “reference filesystem” approach does not help with writing at all. And what if it gets corrupted? Then your whole dataset is lost.

Zarr or TileDB both solve this by breaking out the dataset into multiple objects / files, with a single source of metadata. They can hold arbitrarily large arrays with a constant initialization cost. Appending to the dataset is a simple as adding more chunks to a directory.

Regarding the chunk alignment problem that @rsignell brought up, it was discussed at length in Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?. Our soultion was the rechunker package:

https://rechunker.readthedocs.io/

Rechunker can read NetCDF sources, but it only writes Zarr, for the specific reason that parallel distributed writes to NetCDF in the cloud remains nearly impossible. In fact, distributed writing was the original motivation for Alistair to develop Zarr in the first place!

So while the reference filesystem approach is a great development, I don’t think it obviates the value of the cloud-native formats.

1 Like