Working with file level metadata in Zarr

Hey all,

My team is examining different ways to expose the NetCDF file-level metadata. Our current means of doing so requires us to read the entire NetCDF file into memory or access the metadata, but this lacks scalability. What would a similar process look like if we accessed the metadata through a Zarr array instead of directly through the original NetCDF files?

1 Like

Hi Shane, and welcome to the forum! This is a topic that we have obsessed over in Pangeo, so Iā€™m happy to share what we have learned.

The goal of quickly peeking into netCDF files was indeed one of the main reasons that brought us to experiment with Zarr. You will find that what you want to do is trivial with Zarr + Xarray.

I would recommend just trying out converting your data to Zarr and playing around with it:

import xarray as xr
import zarr
ds_nc = xr.open_dataset('file.nc')
print(ds.attrs)  # display file-level metadata
print(ds.foo.attrs)  # display variable-level metadata
ds_nc.to_zarr('file.zarr', consolidated=True)  # could also be an s3 / gs path
 # consolidated metadata option makes reading faster
ds_zarr = xr.open_zarr('file.zarr', consolidated=True)
print(ds_zarr.attrs)  # it's all there
print(ds_xarr.foo.attrs)

For even faster access to the metadata, bypass xarray completely

zgroup = zarr.open_consolidated('file.zarr')
print(dict(zgroup.attrs))
print(dict(zgroup.foo.attrs))

You could also play around with different cloud-optimized formats (e.g. TileDB) or try out the just-released Zarr-enabled netCDF library

1 Like