OPeNDAP vs. direct file access

rabernat · January 22, 2021, 12:27pm

I think we are something important here. The datasets we are talking about are typically ~10 TB in size. The “reference filesystem” takes a single NetCDF file and makes it work similar speed as an equivalent Zarr. These doesn’t change the fact that we would still typically have hundreds or thousands of NetCDF files. Assembling these all into a coherent datacube would require reading the “reference filesystem” for every file and then checking the coordinates alignment (e.g. whatever xr.open_mfdataset does). This will make it expensive to do the dataset initialization.

The alternative is creating a single 10TB NetCDF file. However, that creates another problem: writing! How do you append a new variable or new record along a dimension to such a dataset? The “reference filesystem” approach does not help with writing at all. And what if it gets corrupted? Then your whole dataset is lost.

Zarr or TileDB both solve this by breaking out the dataset into multiple objects / files, with a single source of metadata. They can hold arbitrarily large arrays with a constant initialization cost. Appending to the dataset is a simple as adding more chunks to a directory.

Regarding the chunk alignment problem that @rsignell brought up, it was discussed at length in Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?. Our soultion was the rechunker package:

https://rechunker.readthedocs.io/

Rechunker can read NetCDF sources, but it only writes Zarr, for the specific reason that parallel distributed writes to NetCDF in the cloud remains nearly impossible. In fact, distributed writing was the original motivation for Alistair to develop Zarr in the first place!

So while the reference filesystem approach is a great development, I don’t think it obviates the value of the cloud-native formats.

Topic		Replies	Views
Recommendation for hosting cloud-optimized data Data	15	2805	January 21, 2022
Cloud array storage solutions Data	3	1195	November 29, 2023
What's the best file format to chose for raster imagery and masks products Data	23	1633	May 18, 2025
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1172	April 23, 2022
ZCollection: a library for partitioning Zarr dataset Data	14	1323	April 11, 2022

OPeNDAP vs. direct file access

Related topics