I’m trying to store a 64MiB xarray
DataSet
to a zarr store via Dataset.to_zarr
and am seeing speeds of about ~10minutes with the following chunks for a DataSet that includes latitude, longitude, forecast day (f_day
) and a mean abs error measuring diff between actual temperature and forecasted temperature on the given f_day
.
I expected much faster write times but am not sure if I’m doing something wrong and am somewhat new to dask
. I tried switching up the chunks to no avail. Any thoughts?
For reference, here is how I’m creating the data:
monthly_means = []
for f_day in forecast_days:
logger.info("Processing forecast day %s", f_day)
# Retrieve data from zarr store in AWS
month_data = self.get_single_month_data(year, month, forecast_day=f_day)
# Compare to in-memory ERA5 data also retrieved from zarr store in AWS and compute mean
# Select only overlapping times between monthly and ERA5 data
monthly_mean = (
month_data - era5_data.sel(time=month_data.time.values)
).mean(dim="time")
monthly_mean["f_day"] = f_day
monthly_means.append(monthly_mean)
# Concatenate means from all forecast days
ds = xr.concat(monthly_means, dim="f_day").set_coords("f_day").rename_vars(
{self.variable: "mean_abs_err"}
)
# Re-chunk data
return ds.chunk({"f_day": -1, "latitude": 360, "longitude": 480})