Extremly slow write to S3 bucket with xarray.Dataset.to_zarr

I’m facing a similar issue. I’m testing the potential of zarr on S3. For my tests, I’m using a 22.2 MB NetCDF file (ECMWF_ERA-40_subset.nc) obtained from https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc.

I open the file with:

ds = xr.open_dataset('ECMWF_ERA-40_subset.nc')

then I dump it as zarr to the local disk with:

ds.to_zarr(store='zarr_example.zarr')

Writing takes approximately 0.46 seconds.

However, if I set the store as an S3 destination and write it as follows:

s3 = s3fs.S3FileSystem(key=..., secret=..., default_fill_cache=False)
store= s3fs.S3Map(root='s3://.../zarr_example.zarr', s3=s3, check=False)
ds.to_zarr(store=store)

…the process takes around 195.94 seconds (!)

The behaviour is similar with a simple 20-something MB geotiff, loaded via rioxarray. This leads me to believe that the issue lies somewhere along the interface with the S3.

It’s worth noting that I’m located in Denmark with a Gigabit connection.

I’d be very interested in your method to achieve such levels of performance.

I’m a bit disappointed in S3FS at the moment because of the abysmal numbers I’m getting.

Hi sotosoul,

I have no experience with S3. The I/O speeds I’m having are on a LUSTRE architecture or on local storages. I’ve tested on my laptop your ERA subset and I’m having a writing speed of O(150ms), to convert the file from the netCDF format into the format we’re using.

The idea behind our format is quite different from zarr: instead of having many small files, which put the burden on the OS and which increases the number of inodes, we put all the data into an as big as possible file (500GB for our experiment).

I could elaborate on that if it is relevant. So far, it’s a very handy data format, quite adapted to our volume of data (several PB from a gridded model) but we didn’t advertise on it, so it remains confidential.

nice, interested to hear more, I saw a HPC talk include Lustre the other day to simulate posix file sys for netcdf on S3 buckets, I don’t think the usecase was as compelling as yours (someone needs to fix an old C lib) but v interested to make the connection across different spheres, the hpc users were praising the Lustre solution :pray:

It’s important to emphasize that the characteristics of a supercomputer filesystem like Lustre and an object store like S3 are extremely different. What works well on Lustre will not work on S3. For example, object store does not support “partial writes” to a region of an object.

I agree there is plenty of performance remaining on the table with xarray + zarr + s3fs. But let’s not mix apples and oranges by comparing with Lustre file storage. If everyone had Lustre, HDF5 would be just fine and we never would have even started experimenting with Zarr.

3 Likes

That’s indeed a super-cool approach. Thanks for sharing.

I think that existing object storage services such as Amazon’s or Google’s are a good fit for smaller organizations without the resources to host an on-prem system or to invest in something not as common as S3.

My colleagues are making fun of me saying that I have a beef with s3fs. Let’s prove them right :smiley:

@sotosoul thanks for the simple example. I thought I’d point out a couple quick suggestions that might help. But getting to the bottom of the to_zarr() slowness will require digging into logs…

If you can use a local scratch disk, writing first to disk then copying the store with s3fs the speed is reasonable (~1s versus your ~100s).

Important Caveat: Directly copying the layout of a NetCDF to Zarr and putting in object storage might not be ideal. In most cases you’ll want to change the chunking, perhaps compression, or consider other formats entirely. More here: https://guide.cloudnativegeo.org. Nevertheless, I’m going for a simple format conversion below:

step1: inspect data and encoding

# NOTE: Specify single chunk per variable array to avoid Zarr heuristics to automatically chunk. 
# Each array is ~1MB.
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc', chunks=-1)
ds

#NOTE: no on-disk chunking from inspecting (ds.tcw.encoding):
{'source': '/tmp/ECMWF_ERA-40_subset.nc',
 'original_shape': (62, 73, 144),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.0013500981745480953,
 'add_offset': 44.3250482744756}

step2: write to local disk

# NOTE: Avoid Zarr's default blosc compression to match original uncompressed arrays
# https://github.com/pydata/xarray/discussions/5798
# https://zarr.readthedocs.io/en/stable/tutorial.html#compressors
for data_var in ds.data_vars:
    ds[data_var].encoding['compressor']=None
%%time

ds.to_zarr(store='zarr_uncompressed.zarr')
# CPU times: user 259 ms, sys: 60.4 ms, total: 320 ms
# Wall time: 186 ms

step3: upload to bucket

%%time 

s3 = s3fs.S3FileSystem()
lpath = 'zarr_uncompressed.zarr'
rpath = 's3://nasa-cryo-scratch/scottyhq/zarr_uncompressed.zarr'
s3.put(lpath, rpath, recursive=True)
#CPU times: user 293 ms, sys: 62.8 ms, total: 356 ms
#Wall time: 957 ms

(I ran this experiment on https://hub.cryointhecloud.com aws us-west-2 on a r5.xlarge machine uploading to a bucket in the same region. xarray=2023.10.1, s3fs=2023.10.0. Edit: also created this issue to investigate the to_zarr() slowness further Slow writes of Zarr files using S3Map · Issue #820 · fsspec/s3fs · GitHub)

1 Like

@sotosoul: I made a simple optimization to your setup that really sped things up for me. The key is to call .chunk() on your data in order to leverage concurrency with Dask.

import xarray as xr
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc')
ds = ds.chunk()

bucket_name = "MY_BUCKET"

# initialize dataset without storing data variables
%time delayed = ds.to_zarr(store=f"s3://{bucket_name}/scratch2.zarr", compute=False)
# -> Wall time: 11.7 s
# This is unacceptably slow, but we have a fix coming that should speed it up massively.

# now store the data
%time ds.to_zarr(store=f"s3://{bucket_name}/scratch2.zarr")
# -> Wall time: 501 ms

I’d be curious to see if you get similar results.

Thank you all for your suggestions :slight_smile:

@scottyhq Your method indeed works and it has the benefit of separating the “zarr things” from the IO, meaning that different IO methods can be used. My personal favorite is a combo of ThreadProcessExecutor + boto3, which has been a little more robust than s3fs in my applications.

@rabernat Unfortunately it didn’t optimize things for me… Honestly I don’t think it’s a chunking issue. We’re talking 20 MB and something like 40-45 files in total for the entire zarr object, it shouldn’t be a big deal. My guess is that the s3fs is doing “its s3fs thing”, that is, treating a cloud storage as local filesystem, which has a much higher IOPS capacity with latency lower than any cloud system one Atlantic Ocean apart. We’ve had many issues with the s3fs dependency of Open Data Cube in the past. One thing I recall is how often it would cause SlowDown errors by S3 and the fact that s3fs did not support deleting empty folders on S3 (empty folders can exist if they were manually created, for example).

I just got off a call with @sotosoul, and I’m documenting what we learned here for the community.

The case he is testing turns out to exacerbate a well-known performance problem with Xarray / Zarr / S3. Specifically, he is writing a dataset with many data variables over a very high latency internet connection (writing an S3 bucket in us-west-2 from a laptop in Copenhagen).

The performance problem is that Xarray initializes the variables in a serial loop, each of which involves a round trip to S3. So the time to initialize the dataset scales roughly like N * number_of_variables * latency. If the latency is 50ms and you have 20 variables, you get about 10s (as in my example, run from the same cloud region). If your latency is 500ms (not unreasonably for a transatlantic ping), you get 200s.

But there is good news! There is a PR in the works in Xarray (by @dcherian) which will do all of these initialization operations concurrently, meaning that we should eliminate the number_of_variables scaling and instead just get initialization that scales with latency alone!

You can already see some massive speedups for this scenario reported there.:tada: I just tried that PR on this dataset, and my initialization time went from 12s to 368ms.

My hope is that we can have this feature released in Xarray very soon.

6 Likes

With the help of @rabernat 's feedback, I proceeded with the following setup:

  1. Identify the lowest latency data center with the help of cloudping.info — in my case eu-central-1. It may be worth noting that in this case I’m connecting using a business fiber line 100/100 line from Wageningen, Netherlands.
  2. Create a conventional S3 bucket to be used as Zarr store in that region
  3. Open a ~500 MB GeoTIFF with the help of rioxarray
  4. Convert the DataArray to Dataset with one variable (to keep things consistent).
  5. Configure chunking (more on that later)
  6. Dump to S3
import rioxarray as rxr
from s3fs import S3FileSystem, S3Map

s3 = S3FileSystem(...)
store = S3Map(root=..., s3=s3, ...)

da = rxr.open_rasterio('path/to/sample500mbfile.tif')
ds = da.to_dataset(name='dummy_datacomponent')
ds = ds.chunk({'band': 1, 'x': 1000, 'y': 1000})
ds.to_zarr(store=store, chunk_store=None, mode="w-", compute=True)

In this scenario, the writing speed is mostly affected by the defined chunking, which controls how many files are written to the S3. Even with compute=True, performance is pretty good and in accordance to @rabernat 's comments when x and y are 1000 and relatively few files are created on S3. If I switch to 100, several thousands files get created and this is where it gets quite slow (minutes vs seconds). Of course the question is how efficient lower chunking will be when it comes to consuming the Zarr data…

Amazon recently launched directory buckets. I haven’t played around with them enough to form an opinion on their performance but, in general, they seem to be placed between EFS and conventional object storage – Amazon advertises their low latency. They don’t work exactly the same way as normal S3 but I’d like to test their performance (and usability) and I’ll share my findings.

As a side comment, I believe that the value of Zarr is not only in its technology but also in the great support offered by this community. Thanks a lot, highly appreciated :slight_smile:

2 Likes

I’m also very interested in what the performance of these buckets is like - any results you find I think would be of interest beyond just this thread :slight_smile:

1 Like

To add to this thread, “simplecache::s3://…” should better support write transaction contexts, such that you can write to your store (writes some local files, but could equally be in memory for metadata) and expect them all to be written to remote together in parallel when the context ends.

1 Like