Error running Re-chunking MUR SST Zarr dataset on Pangeo hub using shared filesystem

I am looking to rechunk the MUR SST dataset which is currently published to AWS public datasets in Zarr format, using a chunk configuration of {‘time’: 5, ‘lat’: 1799, ‘lon’: 3600} to something greater in the time dimension. I am planning to test 60 x 1799 x 3600, 180 x 1023 x 2047 and 379 x 439 x 360.

If possible, I would like to do this using an existing Pangeo hub, since these resources are already part of the Pangeo efforts. However, I have run into an error using the shared file system which appears on at least 1 of N dask workers:

ValueError: array not found at path 'mask'

When I inspect the worker logs, it’s usually just 1 of N workers who see this error, which makes me think there is latency in the NFS mount consistency. And when I look at the files on each worker, they look the same. Could this be a race time condition?

I’m using https://aws-uswest2.pangeo.io/, with 4 workers configured to share the volume mount and each allocated 62GB memory (since 64 is the max per instance available, as I understand)

The notebook with the error and corresponding dask config file are here: https://gist.github.com/abarciauskas-bgse/4a6cdda3bbaa29da80aa4e10d5532b45

cc @rsignell @scottyhq

Any ideas welcome.

1 Like

Responded here: https://github.com/pangeo-data/pangeo/issues/765