I am looking to rechunk the MUR SST dataset which is currently published to AWS public datasets in Zarr format, using a chunk configuration of {‘time’: 5, ‘lat’: 1799, ‘lon’: 3600} to something greater in the time dimension. I am planning to test 60 x 1799 x 3600, 180 x 1023 x 2047 and 379 x 439 x 360.
If possible, I would like to do this using an existing Pangeo hub, since these resources are already part of the Pangeo efforts. However, I have run into an error using the shared file system which appears on at least 1 of N dask workers:
ValueError: array not found at path 'mask'
When I inspect the worker logs, it’s usually just 1 of N workers who see this error, which makes me think there is latency in the NFS mount consistency. And when I look at the files on each worker, they look the same. Could this be a race time condition?
I’m using https://aws-uswest2.pangeo.io/, with 4 workers configured to share the volume mount and each allocated 62GB memory (since 64 is the max per instance available, as I understand)
The notebook with the error and corresponding dask config file are here: https://gist.github.com/abarciauskas-bgse/4a6cdda3bbaa29da80aa4e10d5532b45
Any ideas welcome.