Problems with operations encompassing multiple times in CM 2.6 data

Hi Pangeo,

I am trying to work with GFDL CM 2.6 sea surface heights, and I have been having trouble performing some basic operations on this data. I am running Python 3.7 on my local machine. The data array of interest contains 3600x2700 lat/long points over 5844 times.

I have noticed that operations across space take virtually no time:

In[3] %time sea_level_data.isel(time=0).plot()
CPU times: user 1.73 s, sys: 836 ms, total: 2.56 s
Wall time: 6.44 s
Out[3]: <matplotlib.collections.QuadMesh at 0x7ffb64569110>

But operations across several times (such as the following snippet) take forever, even for a single location:

single_location = sea_level_data.sel(yt_ocean=0, xt_ocean=0, method='nearest')
single_location.mean(dim='time').compute()

I originally thought this was because of how Dask chunks the dataset into separate times, but rechunking the data doesn’t make the computation run any faster. It might have to do with how the data is stored.

Does anyone have any suggestions about how I can quickly perform operations involving multiple times? For instance, I would eventually like to have plots of sea level anomaly (removing time mean) and histograms of sea level at particular spatial coordinates.

Best,

Andrew

1 Like

Hi @andrewbrettin – welcome to Pangeo and thanks for the excellent question! You’ve stumbled on to one of the most challenging and confusing aspects of working with xarray / dask on large datasets.

You are correct in your intuition that this is related to chunking. However, the fundamental problem is not really dask itself–it’s how the data is laid out on disk. You didn’t specify how you are reading the data, but I’ll assume you are reading from netCDF files. The standard way these files are written is with all the spatial points in one file, but with the time dimension spread over many files. Despite the fact that you only want a timeseries at a single point, you still have to open and read all of the files.

The ability to use Xarray and Dask to open all the files as a single object obscures this fundamental fact and makes it seem simpler than it really is to get a timeseries. This problem is discussed at length in this post:

I assume here you mean that you did something like

ds_rechunked = ds.chunk({'time': -1, 'space': 1})

It’s not surprising that this had no effect on performance. It’s a “lazy” operation. It provides a different view of the same data on disk. In order to actually load the chunks of your “rechunked” dataset, you still have to open and read all of the netCDF files.

One possible workaround, which applies only if your netCDF files do not use netCDF4 chunks, is to specify smaller spatial chunks when you create the dataset:

ds = xr.open_mfdataset('files_*.nc', chunks={'xt_ocean': 100, 'yt_ocean': 100})

This will avoid forcing you to load into memory the full spatial data while extracting your timeseries. Instead you will only load each 100x100 chunk, which may be a bit faster. The downside is that you will create many many more chunks.

If you really want just one single point, you could try to use a preprocessing function to select only the point you need:

def select_point(ds):
    return ds.isel(yt_ocean=0, xt_ocean=0)

ds = xr.open_mfdataset('files_*.nc', preprocess=select_points)

However, if the underlying data files do have internal chunking, this approach won’t really speed things up, since each chunk still has to be read in full.

A more robust solution if you need to do lots of timeseries analysis on a dataset that is chunked in time is to create rechunked copy of your data on disk. That’s what was suggested in the post linked above. The outcome of that discussion was the creation of a new package for this purpose:

It’s not quite ready for use but should be released very soon.

I hope this helps!

Hi @rabernat - thanks for your prompt and detailed response!

I’m using intake.open_catalog() to open the catalog at https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean/GFDL_CM2.6.yaml. I’m not using xarray’s open_dataset function, so I’m not sure if I can use the workaround that you discussed:

One possible workaround, which applies only if your netCDF files do not use netCDF4 chunks, is to specify smaller spatial chunks when you create the dataset : ds = xr.open_mfdataset('files_*.nc', chunks={'xt_ocean': 100, 'yt_ocean': 100})

I’m not very familiar with intake catalogs yet so I don’t know if this applies here. I suppose the data is stored in zarr format, not netCDF?

Correct. The data are stored in Zarr, and the chunk sizes are already set and fixed. So there is no option other than rechunking.

Since you’re working on Google-Cloud-based data, it would be most efficient to work from within the cloud, i.e. via ocean.pangeo.io or another compute instance in Google Cloud.

1 Like

Okay, great! And just to spell things out for me, it sounds like from this post that access to the free deployment at ocean.pangeo.io is going down over the next few weeks. So the best way for me to do this would be to set up my own Google Cloud account and follow the instructions here to create my own deployment, correct?

Ocean will only be down for a few days.

That is kind of our default answer. But to be honest, it doesn’t make a lot of sense to set up a deployment for one person. It’s a lot of work, and those instructions are out of date. We don’t really have a good answer right now for a single person who wants to use Pangeo in the cloud on their own cloud account.

Sorry for all of the uncertain and incomplete answers… We still have a lot of work to do to really make this vision of on-demand big data analysis for everyone a reality.

1 Like