Thank you for the motivations and support from the Pangeo community.
I did some preliminary performance testing using Zarr/dask, on GFDL post-processing and analysis cluster…and I was hoping to get some feedback on the following.
There is a remarkable performance with zarr (versus NetCDF). When I used about 100 workers, I expected the throughput to increase further though. While I think there can be a few reasons as to why this did not happen in this particular instance… I wanted to know your thoughts and suggestions so I can improve and continue learning this. Please find below more info.
Total size of the dataset: 46GB
Zarr.info
<bound method Dataset.info of <xarray.Dataset>
Dimensions: (bnds: 2, lat: 360, lon: 576, time: 105192)
Coordinates:
height float64 ...
-
lat (lat) float64 -89.75 -89.25 -88.75 -88.25 … 88.75 89.25 89.75
-
lon (lon) float64 0.3125 0.9375 1.562 2.188 … 358.4 359.1 359.7
-
time (time) object 2015-01-01 03:00:00 … 2051-01-01 00:00:00
Dimensions without coordinates: bnds
Data variables:
lat_bnds (time, lat, bnds) float64 dask.array<chunksize=(58440, 360, 2), meta=np.ndarray>
lon_bnds (time, lon, bnds) float64 dask.array<chunksize=(58440, 576, 2), meta=np.ndarray>
tas (time, lat, lon) float32 dask.array<chunksize=(10, 360, 576), meta=np.ndarray>
Attributes:
…
SLURMCluster(queue=‘batch’,memory=48GB,project=‘xx’,cores=6,walltime=‘2:60:00’)…
cluster size is scaled from 40 to 100.
The size of the dataset is fixed when we scale up the cluster.
Image attached for the chunk structure, computation time.
The notebook can be found here.