Avoid metadata reads when loading many similar NetCDF files

I’m working with someone who has lots of NetCDF files on S3. They’ve got fsspec pointers to those files and they’re passing them into xarray.open_mfdataset

fileobs = [...]
ds = xarray.open_mfdataset(fileobjs)

Reading the metadata here is quite slow (several minutes). What do people typically do in this case? I could imagine a couple options:

  1. Is there some way to say “Hey, these files all have the same dimensions, except for time, which is offset by one day per file” . Xarray might then read the first few files to verify my claim, but then extrapolate from there?
  2. Is this what the parallel=True keyword is for? I’m guessing that that loads metadata remotely on Dask workers?
  3. Something else even more clever?

Thanks for any advice

it does sound like you might be looking for kerchunk. With that, reading the metadata will not be any faster, but you’re only doing it once, any subsequent open only need to read the reference file (json or parquet), which usually is much faster.

As for 2: yes, that’s exactly what it does: it wraps open_dataset calls and optionally the preprocess function, if any, using dask.delayed (or something equivalent if you configured a different chunk manager), computes that and combines the datasets.

2 Likes

kerhcunk is the way. In the mean time, you can optimize open_mfdataset a little bit: Reading and writing files (see the Note) but it will still open every single file.

There are a number of options in mfdataset, e.g., compat="override" that might save you some time. However, xarray will certainly need to read at least the concat coord from each file, which means opening each file and reading its basic metadata to find where those coords live. Kerchunk is faster because this process is done once beforehand and then you only need load the kerchunk metadata in a single fetch.

Alternatively, if you already know everything about the arrays of your coordinates, you can construct your contained dask.array yourself. I assume this is not what you are after.

Another thing to consider on why this is slow over HTTP is how the metadata is structured in the files, most of these datasets are using default layouts intended for local access, e.g. some NASA datasets like ICESAT-2 or data from MERRA (just to mention some) have this issue. If we run h5stat on them we’ll notice that the File space page size for metadata is set to 4kb !!

When the client libraries (h5netcdf, h5py etc) read these files from disk this doesn’t really matter but when each of these reads becomes a HTTP range request then things get really slow. Kerchunk will definitely help, perhaps it could even be shipped with xarray as an optional operation for open_mfdataset() e.g.


ds = xr.open_mfdataset(
   files,
   kerchunk=True,
   store_consolidated_metadata="./some_dir/"
)

Then the next time we use these files it won’t be so slow.

Lastly, hdf5 and netcdf could be optimized for cloud access too, but that would require that the data providers reprocess entire datasets and that’s not so simple because reasons…