Extremely slow xarray/zarr writes

I’m trying to store a 64MiB xarray DataSet to a zarr store via Dataset.to_zarr and am seeing speeds of about ~10minutes with the following chunks for a DataSet that includes latitude, longitude, forecast day (f_day) and a mean abs error measuring diff between actual temperature and forecasted temperature on the given f_day.

I expected much faster write times but am not sure if I’m doing something wrong and am somewhat new to dask. I tried switching up the chunks to no avail. Any thoughts?

For reference, here is how I’m creating the data:

        monthly_means = []
        for f_day in forecast_days:
            logger.info("Processing forecast day %s", f_day)
            # Retrieve data from zarr store in AWS
            month_data = self.get_single_month_data(year, month, forecast_day=f_day)
            # Compare to in-memory ERA5 data also retrieved from zarr store in AWS and compute mean
            # Select only overlapping times between monthly and ERA5 data
            monthly_mean = (
                month_data - era5_data.sel(time=month_data.time.values)
            ).mean(dim="time")

            monthly_mean["f_day"] = f_day
            monthly_means.append(monthly_mean)
        # Concatenate means from all forecast days
        ds = xr.concat(monthly_means, dim="f_day").set_coords("f_day").rename_vars(
            {self.variable: "mean_abs_err"}
        )
        # Re-chunk data
        return ds.chunk({"f_day": -1, "latitude": 360, "longitude": 480})

Hey @vbalza :wave:

Are you writing your zarr store to local storage or to cloud storage? There is usually quite a difference in speed between the two.

Here is an older pangeo post about xarray & zarr write speeds. Tons of info here on timing strategies, latency etc.

FWIW, I tried recreating a similar dataset to yours to get some timings:

import xarray as xr
import numpy as np

time = np.arange(16)
lat = np.linspace(-90, 90, 721)
lon = np.linspace(-180, 180, 1440)

mean_abs_error = np.random.randint(0, 100, size=(time_dim, lat_dim, lon_dim), dtype=np.int32)

ds = xr.Dataset(
    data_vars={
        "mean_abs_error": (("time", "lat", "lon"), mean_abs_error)
    },
    coords={
        "time": time,
        "lat": lat,
        "lon": lon
    }
)
ds

This mock dataset is about 66MB.

Saving this dataset to local disk takes about ~1.2 seconds:

%%time
ds.to_zarr('tmp.zarr',mode='w',consolidated=True)

Since your dataset is chunked, you could try using dask to speed-up your write:

from distributed import Client

num_workers 8 # adjust as needed
client = Client(n_workers=num_workers)
client

...
ds.to_zarr(<path.zarr>, consolidated=True)

Hope it helps!

10 minutes for 64MB is absurdly slow - are you sure it’s the write step that is actually taking up the time, and not the opening/loading/compute step?

See also this comment

1 Like

@TomNicholas You’re right—it might actually be the compute speed because once I call .compute() upon taking the mean, the writing step speeds up. Though I’m still trying to figure out how to speed up the mean computation.

1 Like

How is the ERA5 data loaded? Maybe that is somehow limiting your performance. Also, I would try again without the last re-chunk step to see if that makes a difference.

1 Like

Interesting, can you show us all the code please (perhaps upload a reproducible notebook somewhere)?