Passing xr.open_mfdataset preprocess function unique arguments

Hi all, I was wondering if anyone had advice on passing unique arguments to the xarray xr.open_mfdataset preprocess function. I know you can use functools.partial() to pass arguments to the preprocess function, but I expect those arguments are not specific to each file.

The workflow I seem to run into often with model data is I want to do ensemble analysis, which usually includes creating a new ensemble dimension to take advantage of many of xarrays functions (such as weighted averages). With nice clean CMIP data you could inspect the global NetCDF metadata for CMOR tags like source_id, variant_label, and experiment_id and easily make a preprocess function that does a ds.expand_dims(). However, often data I work with isn’t clean and doesn’t include global attributes useful to making an ensemble dimension.

What I tend to do in these circumstances is simple for loop over each model, add the metadata I need (say from a DataFrame), put these xr.Datasets in a list and combine with xr.merge or xr. combine_by_coords. It works, but isn’t very efficient since they are loaded serially rather than a parallel open with open_mfdataset. It seems like a clever use of preprocess function with arguments specific to each file would be better than a for loop.

Thoughts?

There are only 2 ways to pass file-specific context to the preprocess function:

  1. Use information that is contained within the Dataset itself (e.g. file-specific metadata),
  2. Use the file-name, which the docs tell you how to access:

    You can find the file-name from which each dataset was loaded in ds.encoding["source"]

If your files are numbered in some pattern you could use that along with functools.partial to pull out only the metadata you want from your DataFrame.

However if you’re doing something this complicated over and over for the same set of files, you might instead want to consider using VirtualiZarr to effectively cache the result of your open_mfdataset call, so that every subsequent access can just be via xr.open_zarr instead.

1 Like