How to identify the array addresses in NetCDF/HDF files (for fsspec-reference-maker)?

Hi @martindurant,

I watched your excellent video about the ReferenceFileSystem and took a look at:

Do you have any code/docs/tips on how to find out where the arrays (or compressed arrays) are kept inside a NetCDF/HDF file? Can you use the netCDF4/HDF5 libraries to interrogate the internal structure in order to find these?

Thanks, Ag

2 Likes

Yes indeed, and this is what the hdf module in that repo exactly does, using h5py. You can run it pretty much like a script (on a pangeo-forge recipe) for individual files and combine them into aggregate datasets too (in the combine module).

2 Likes

Thanks @martindurant, that’s great.

Can it do all this by only reading the NetCDF Header?

2 Likes

Correct, it only reads the metadata (which is generally spread throughout the file - it’s only all in the header if you are lucky).

2 Likes

Thanks @martindurant

It works first time on a random ERA5 file I tested. Looks really promising.

2 Likes

Great! Keep us informed or open an issue at reference-maker with problems or success :slight_smile:

I recommend using cache_type="first" when using fsspec to open files to pass to h5py, since most of the metadata tends to be at the start of the file, and other metadata pieces are randomly scattered, so read-ahead doesn’t help.

2 Likes