@sotosoul thanks for the simple example. I thought I’d point out a couple quick suggestions that might help. But getting to the bottom of the to_zarr()
slowness will require digging into logs…
If you can use a local scratch disk, writing first to disk then copying the store with s3fs the speed is reasonable (~1s versus your ~100s).
Important Caveat: Directly copying the layout of a NetCDF to Zarr and putting in object storage might not be ideal. In most cases you’ll want to change the chunking, perhaps compression, or consider other formats entirely. More here: https://guide.cloudnativegeo.org. Nevertheless, I’m going for a simple format conversion below:
step1: inspect data and encoding
# NOTE: Specify single chunk per variable array to avoid Zarr heuristics to automatically chunk.
# Each array is ~1MB.
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc', chunks=-1)
ds
#NOTE: no on-disk chunking from inspecting (ds.tcw.encoding):
{'source': '/tmp/ECMWF_ERA-40_subset.nc',
'original_shape': (62, 73, 144),
'dtype': dtype('int16'),
'missing_value': -32767,
'_FillValue': -32767,
'scale_factor': 0.0013500981745480953,
'add_offset': 44.3250482744756}
step2: write to local disk
# NOTE: Avoid Zarr's default blosc compression to match original uncompressed arrays
# https://github.com/pydata/xarray/discussions/5798
# https://zarr.readthedocs.io/en/stable/tutorial.html#compressors
for data_var in ds.data_vars:
ds[data_var].encoding['compressor']=None
%%time
ds.to_zarr(store='zarr_uncompressed.zarr')
# CPU times: user 259 ms, sys: 60.4 ms, total: 320 ms
# Wall time: 186 ms
step3: upload to bucket
%%time
s3 = s3fs.S3FileSystem()
lpath = 'zarr_uncompressed.zarr'
rpath = 's3://nasa-cryo-scratch/scottyhq/zarr_uncompressed.zarr'
s3.put(lpath, rpath, recursive=True)
#CPU times: user 293 ms, sys: 62.8 ms, total: 356 ms
#Wall time: 957 ms
(I ran this experiment on https://hub.cryointhecloud.com aws us-west-2 on a r5.xlarge machine uploading to a bucket in the same region. xarray=2023.10.1, s3fs=2023.10.0. Edit: also created this issue to investigate the to_zarr() slowness further Slow writes of Zarr files using S3Map · Issue #820 · fsspec/s3fs · GitHub)