Hello,
I am running into extremely slow runtime when writing xarray.Dataset
to the S3 bucket in Zarr format. I am able to reproduce the problem with this code snippet which creates xr.Dataset with 20 xr.DataArray’s. Am I doing something wrong?
zarr documentation mentions that selected compression can significantly affect the runtime. I tried to use Blosc’s lz4 compression but it ran even longer.
Do I need to convert my arrays to Dask arrays before I write to Zarr?
Is there a recommendation on selecting the blocksize for the compression? I don’t provide one, so it’s selected automatically. But I wonder if I should provide one and if it should be in some agreement with the chunking size that I specify in encoding settings for the Zarr?
import xarray as xr
import numpy as np
import s3fs
import zarr
# Create dataset with a bunch of arrays
ds = xr.Dataset()
compressor = zarr.Blosc(cname="zlib", clevel=2, shuffle=1)
encoding_settings = {}
for index in range(20):
da_name = f'foo_{index}'
da = xr.DataArray(
np.arange(250*800*800).reshape((250, 800, 800)),
name=da_name,
dims=('t', 'x', 'y')
)
ds[da_name] = da
encoding_settings[da_name] = {'_FillValue': -32767.0, 'compressor': compressor, 'dtype': 'short', 'chunks': (25000, 10, 10)}
# Write to Zarr
s3_out = s3fs.S3FileSystem(anon=False)
store_out = s3fs.S3Map(root='s3://its-live-data/test_datacubes/ds_s3.zarr', s3=s3_out, check=False)
ds.to_zarr(store_out, mode='w', encoding=encoding_settings, consolidated=True)
This takes about an hour to run on EC2 instance within the same us-west-2
region as the S3 bucket resides in.
Thank you for any input or suggestions,
Masha