Feedback on Zarr performance benchmarking

Thank you for the motivations and support from the Pangeo community.
I did some preliminary performance testing using Zarr/dask, on GFDL post-processing and analysis cluster…and I was hoping to get some feedback on the following.

There is a remarkable performance with zarr (versus NetCDF). When I used about 100 workers, I expected the throughput to increase further though. While I think there can be a few reasons as to why this did not happen in this particular instance… I wanted to know your thoughts and suggestions so I can improve and continue learning this. Please find below more info.

Total size of the dataset: 46GB

<bound method of <xarray.Dataset>

Dimensions: (bnds: 2, lat: 360, lon: 576, time: 105192)


height    float64 ...
  • lat (lat) float64 -89.75 -89.25 -88.75 -88.25 … 88.75 89.25 89.75

  • lon (lon) float64 0.3125 0.9375 1.562 2.188 … 358.4 359.1 359.7

  • time (time) object 2015-01-01 03:00:00 … 2051-01-01 00:00:00

Dimensions without coordinates: bnds

Data variables:

lat_bnds  (time, lat, bnds) float64 dask.array<chunksize=(58440, 360, 2), meta=np.ndarray>

lon_bnds  (time, lon, bnds) float64 dask.array<chunksize=(58440, 576, 2), meta=np.ndarray>

tas       (time, lat, lon) float32 dask.array<chunksize=(10, 360, 576), meta=np.ndarray>


cluster size is scaled from 40 to 100.
The size of the dataset is fixed when we scale up the cluster.

Image attached for the chunk structure, computation time.

The notebook can be found here.

1 Like

Hi @aradhakrishnanGFDL; thanks for your post and welcome to Pangeo!

I think the basic problem is that your dataset is too small to benefit from further scaling. It order to verify this, it would be useful to see the dask performance report for some of your runs. But that’s my hunch.

You can think of your total time as the sum of two main parts:

  1. The time it takes Dask to process your task graph. This is roughly proportional to the number of tasks (probably 5-10 s in your case) and does not improve with more workers. (Read the dask docs on best practices for more info about this.)
  2. The time it takes to actually read the data from disk. If you filesystem can deliver parallel scaling, then this should decrease with the number of workers.

You want 2 to be much greater than 1 in order to see scaling behavior. However, this is probably not the case. In order to overcome this, you want to make a much bigger dataset. For comparison, in the cloud storage benchmarks, we make the dataset larger as we increase the number of workers. For 100 workers, we are commonly reading 1-2 TB of data.

However, with 8 MB chunks, you will have a problem if you try to get to these levels: your graph will be too big. So I also recommend you use much larger chunks: 100-300 MB would work well.

So to summarize, try making a test dataset that is 1TB, with 250MB chunks, and try your experiment again. Please report your findings back to us.