Problems with operations encompassing multiple times in CM 2.6 data

Hi @andrewbrettin – welcome to Pangeo and thanks for the excellent question! You’ve stumbled on to one of the most challenging and confusing aspects of working with xarray / dask on large datasets.

You are correct in your intuition that this is related to chunking. However, the fundamental problem is not really dask itself–it’s how the data is laid out on disk. You didn’t specify how you are reading the data, but I’ll assume you are reading from netCDF files. The standard way these files are written is with all the spatial points in one file, but with the time dimension spread over many files. Despite the fact that you only want a timeseries at a single point, you still have to open and read all of the files.

The ability to use Xarray and Dask to open all the files as a single object obscures this fundamental fact and makes it seem simpler than it really is to get a timeseries. This problem is discussed at length in this post:

I assume here you mean that you did something like

ds_rechunked = ds.chunk({'time': -1, 'space': 1})

It’s not surprising that this had no effect on performance. It’s a “lazy” operation. It provides a different view of the same data on disk. In order to actually load the chunks of your “rechunked” dataset, you still have to open and read all of the netCDF files.

One possible workaround, which applies only if your netCDF files do not use netCDF4 chunks, is to specify smaller spatial chunks when you create the dataset:

ds = xr.open_mfdataset('files_*.nc', chunks={'xt_ocean': 100, 'yt_ocean': 100})

This will avoid forcing you to load into memory the full spatial data while extracting your timeseries. Instead you will only load each 100x100 chunk, which may be a bit faster. The downside is that you will create many many more chunks.

If you really want just one single point, you could try to use a preprocessing function to select only the point you need:

def select_point(ds):
    return ds.isel(yt_ocean=0, xt_ocean=0)

ds = xr.open_mfdataset('files_*.nc', preprocess=select_points)

However, if the underlying data files do have internal chunking, this approach won’t really speed things up, since each chunk still has to be read in full.

A more robust solution if you need to do lots of timeseries analysis on a dataset that is chunked in time is to create rechunked copy of your data on disk. That’s what was suggested in the post linked above. The outcome of that discussion was the creation of a new package for this purpose:

It’s not quite ready for use but should be released very soon.

I hope this helps!